Re: [PATCH] expose kvmclock upper msr set.
On 02/25/2011 03:11 PM, Glauber Costa wrote: On Thu, 2011-02-24 at 18:54 -0500, Steven Rostedt wrote: On Thu, 2011-02-24 at 20:48 -0300, Glauber Costa wrote: On Thu, 2011-02-24 at 18:24 -0500, Steven Rostedt wrote: On Wed, Feb 23, 2011 at 12:44:14PM -0500, Glauber Costa wrote: We've been supporting kvmclock MSRs in the 0x4b564d00-0x4b564dff range for a while now, but we're not exposing it yet, meaning nobody is using it. This simple patch takes care of that. Is nobody using it because it was never exposed? Or because nobody ever needed it, and we don't care (thus don't bother supporting it). The former. Our guest kernels will only rely on features that are exposed, meaning that if they are not, the guest kernel will never know it is available. Might want to rephrase your change log, as to me it sounds like nobody is using it because it is not needed. Adding the Our guest... from your response to the change log will clear that up. Thanks, I can do that, sure. OTOH, I know avi changed changelogs for clarity a couple of times before, so if there is no code change needed, maybe he think it is easier to rephrase it before picking it up. Avi? If that's the only change, no problem. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] kvm: Allow memory slot array to grow on demand
On 02/24/2011 08:08 PM, Alex Williamson wrote: @@ -207,7 +206,7 @@ struct kvm_mmu_page { * One bit set per slot which has memory * in this shadow page. */ - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); + unsigned long *slot_bitmap; What about union { DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG); unsigned long *indirect_slot_bitmap; }; to make the hackery below more explicit? Yeah, it need something to make the hackery go down easier. I was actually thinking about: unsigned long *slot_bitmap; DECLARE_BITMAP(direct_slot_bitmap, BITS_PER_LONG); Where we'd then just set: slot_bitmap =direct_slot_bitmap; It wastes 8 bytes, and pushes the cache a little harder, but still helps the locality and makes the usage more consistent. unsigned long *sp_slot_bitmap(struct kvm_mmu_page *sp) { ... } gives you the best of both worlds. We don't support failing kvm_mmu_get_page(). See mmu_memory_cache_alloc() and mmu_topup_memory_caches(). Hmm, apparently my search stopped at __direct_map() calling kvm_mmu_get_page() and handling an error. That's dead code (was there from the very first commit into mmu.c). r = -ENOMEM; - slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); + + if (mem-slot= kvm-memslots-nmemslots) { + nmemslots = mem-slot + 1; + flush = true; Isn't flush here a little too agressive? Shouldn't we flush only if we cross the BITS_PER_LONG threshold? Perhaps, but is that overly exploiting our knowledge about the bitmap implementation? I figured better to error too aggressively than too lazy since this is a rare event already. I'm worried about the screen-clearing using the vga window at 0xa[08]000. If that works without too much flushing, then we're fine. On second thoughts we're likely fine even if we do flush, since it's in a tight loop so it takes very little work to reestablish the dropped sptes. @@ -1832,6 +1854,8 @@ static long kvm_vm_ioctl(struct file *filp, sizeof kvm_userspace_mem)) goto out; + kvm_userspace_mem.slot += KVM_PRIVATE_MEM_SLOTS; + Slightly uneasy about this, but no real objection. If you have better ideas, let me know. This reminds me to ask about this chunk: @@ -671,7 +674,7 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Check for overlaps */ r = -EEXIST; - for (i = 0; i KVM_MEMORY_SLOTS; ++i) { + for (i = KVM_PRIVATE_MEM_SLOTS; i kvm-memslots-nmemslots; ++i) { struct kvm_memory_slot *s =kvm-memslots-memslots[i]; if (s == memslot || !s-npages) I kept the same behavior as previous, but it highlights that we're not checking for overlaps between private slots and anything else. Existing bug? Thanks, Yes, possibly serious. Who knows what happens if we create a page using one slot and remove it via another? Let's go write some Visual Basic. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory slots using wbtree
On 02/24/2011 07:35 PM, Alex Williamson wrote: On Thu, 2011-02-24 at 12:06 +0200, Avi Kivity wrote: On 02/23/2011 09:28 PM, Alex Williamson wrote: I had forgotten about1M mem, so actually the slot configuration was: 0:1M 1: 1M - 3.5G 2: 4G+ I stacked the deck in favor of the static array (0: 4G+, 1: 1M-3.5G, 2: 1M), and got these kernbench results: base (stdev)reorder (stdev) wbtree (stdev) +-+++ Elapsed | 42.809 (0.19) | 42.160 (0.22) | 42.305 (0.23) | User| 115.709 (0.22) | 114.358 (0.40) | 114.720 (0.31) | System | 41.605 (0.14) | 40.741 (0.22) | 40.924 (0.20) | %cpu| 366.9 (1.45) | 367.4 (1.17) | 367.6 (1.51) | context | 7272.3 (68.6) | 7248.1 (89.7) | 7249.5 (97.8) | sleeps | 14826.2 (110.6) | 14780.7 (86.9) | 14798.5 (63.0) | So, wbtree is only slightly behind reordering, and the standard deviation suggests the runs are mostly within the noise of each other. Thanks, Doesn't this indicate we should use reordering, instead of a new data structure? The original problem that brought this on was scaling. The re-ordered array still has O(N) scaling while the tree should have ~O(logN) (note that it currently doesn't because it needs a compaction algorithm added after insert and remove). So yes, it's hard to beat the results of a test that hammers on the first couple entries of a sorted array, but I think the tree has better than current performance and more predictable when scaled performance. Scaling doesn't matter, only actual performance. Even a guest with 512 slots would still hammer only on the first few slots, since these will contain the bulk of memory. If we knew when we were searching for which type of data, it would perhaps be nice if we could use a sorted array for guest memory (since it's nicely bounded into a small number of large chunks), and a tree for mmio (where we expect the scaling to be a factor). Thanks, We have three types of memory: - RAM - a few large slots - mapped mmio (for device assignment) - possible many small slots - non-mapped mmio (for emulated devices) - no slots The first two are handled in exactly the same way - they're just memory slots. We expect a lot more hits into the RAM slots, since they're much bigger. But by far the majority of faults will be for the third category - mapped memory will be hit once per page, then handled by hardware until Linux memory management does something about the page, which should hopefully be rare (with device assignment, rare == never, since those pages are pinned). Therefore our optimization priorities should be - complete miss into the slot list - hit into the RAM slots - hit into the other slots (trailing far behind) Of course worst-case performance matters. For example, we might (not sure) be searching the list with the mmu spinlock held. I think we still have a bit to go before we can justify the new data structure. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Copying netdev: looks like memory corruption in the networking stack. Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the attachment). On 02/24/2011 11:15 PM, Ruben Kerkhof wrote: On Tue, Feb 15, 2011 at 18:16, Marcelo Tosattimtosa...@redhat.com wrote: This and the others reported. So yes, it looks something is corrupting memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option. Ok, there are now only 6 vms left on this host, and I've booted it with the slub_debug=ZFPU option. After a few hours, I got the following result: 2011-02-24T21:41:30.818496+01:00 phy005 kernel: = 2011-02-24T21:41:30.818517+01:00 phy005 kernel: BUG kmalloc-2048 (Not tainted): Object padding overwritten 2011-02-24T21:41:30.818523+01:00 phy005 kernel: - 2011-02-24T21:41:30.818526+01:00 phy005 kernel: 2011-02-24T21:41:30.818530+01:00 phy005 kernel: INFO: 0x8806230752ca-0x8806230752cf. First byte 0x0 instead of 0x5a 2011-02-24T21:41:30.818534+01:00 phy005 kernel: INFO: Allocated in __netdev_alloc_skb+0x34/0x51 age=2231 cpu=8 pid=0 2011-02-24T21:41:30.818537+01:00 phy005 kernel: INFO: Freed in skb_release_data+0xc9/0xce age=2368 cpu=8 pid=2159 2011-02-24T21:41:30.818541+01:00 phy005 kernel: INFO: Slab 0xea00157a9880 objects=15 used=13 fp=0x8806230752d0 flags=0x404083 2011-02-24T21:41:30.818545+01:00 phy005 kernel: INFO: Object 0x880623074a88 @offset=19080 fp=0x8806230752d0 The rest of the output is attached since it's quite large. Kind regards, Ruben -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Fix non-PCI target build
On 02/23/2011 10:28 AM, Jan Kiszka wrote: Replace obsolete qemu-kvm.h with kvm.h in pci.c and build that module just like upstream does. This fixes non-x86 targets which have no PCI support. Thanks, applied to mainline and stable-0.14. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] expose kvmclock upper msr set.
On 02/23/2011 07:44 PM, Glauber Costa wrote: We've been supporting kvmclock MSRs in the 0x4b564d00-0x4b564dff range for a while now, but we're not exposing it yet, meaning nobody is using it. This simple patch takes care of that. We're exposing them via KVM_GET_SUPPORTED_CPUID, leaf KVM_CPUID_FEATURES. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND] [PATCH 1/1] qemu-kvm (device-assignment): detecting the pre-fectchable memory region.
On 02/26/2011 06:44 PM, Prasad Joshi wrote: During device assignment the memory pre-fetchable flag was discarded as the IORESOURCE_PREFETCH was defined as 0x1000 when instead it should have been 0x2000. Following small patch fixes the problem, please apply. Applied, thanks. Note this patch was somehow corrupted with 0xa0 characters instead of good old spaces. Please check your mail config. Signed-off-by: Prasad Joshiprasadjoshi...@gmail.com --- diff --git a/hw/device-assignment.c b/hw/device-assignment.c index e5205cf..2cfa155 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -47,7 +47,7 @@ #define IORESOURCE_MEM 0x0200 #define IORESOURCE_IRQ 0x0400 #define IORESOURCE_DMA 0x0800 -#define IORESOURCE_PREFETCH 0x1000 /* No side effects */ +#define IORESOURCE_PREFETCH 0x2000 /* No side effects */ /* #define DEVICE_ASSIGNMENT_DEBUG 1 */ -- The last line from the diff was missing as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: Fixes the qemu-kvm compilation failure when configured with disable-kvm
On 02/26/2011 06:38 PM, Prasad Joshi wrote: I pulled the latest qemu-kvm code and configured it with disabled-kvm and related options. Configuration finishes well, but the compilation fails. This is already fixed in tree. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] kvm-unit-tests: add x86 port io accessors
On 02/24/2011 11:48 PM, Anthony Liguori wrote: Signed-off-by: Anthony Liguorialigu...@us.ibm.com diff --git a/lib/x86/io.h b/lib/x86/io.h new file mode 100644 index 000..bd6341c --- /dev/null +++ b/lib/x86/io.h @@ -0,0 +1,40 @@ +#ifndef IO_H +#define IO_H + +static inline unsigned char inb(unsigned short port) +{ +unsigned char value; +asm volatile(inb %w1, %0 : =a (value) : Nd (port)); +return value; +} Are those %[wb] really needed? gcc should do the right thing based on the argument type. Might as well put all of that into processor.h. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm-unit-tests: do not set level sensitive irq when initializing the PIC
On 02/24/2011 11:48 PM, Anthony Liguori wrote: I'm not sure if this was intentional but the QEMU i8259 does not support this flag. I haven't observed any issues with this but I'll happily admit that I'm not very aware of what I'm doing here. Signed-off-by: Anthony Liguorialigu...@us.ibm.com static u32 xapic_read(unsigned reg) { return *(volatile u32 *)(g_apic + reg); @@ -133,7 +129,7 @@ void ioapic_write_redir(unsigned line, ioapic_redir_entry_t e) void enable_apic(void) { printf(enabling apic\n); -xapic_write(0xf0, 0x1ff); /* spurious vector register */ +xapic_write(0xf0, 0x1f7); /* spurious vector register */ } Not sure what you're doing here. You're changing the APIC Spurious Vector from 0xff to 0xf7? This has nothing to do with the i8259 or level triggeredness as far as I can tell - it just enables the APIC (bit 8) and selects a vector for reporting spurious interrupts (0xff happens to be the reset value). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] kvm-unit-tests: make I/O more friendly to existing QEMU hardware
On 02/24/2011 11:48 PM, Anthony Liguori wrote: Use the serial port for printf() and use the Bochs bios exit port if the testdev port isn't available. This unconditionally switches to use the serial port but tries to use the testdev exit port since that lets you pass an exit status. The only issue I can see is that the serial port is much slower than the testdev port (since we can use rep outsb there). That only matters for access.flat in full output mode (not default). So please keep the old code under #if 0 so we can easily switch back to it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/3] kvm-unit-tests: add x86 port io accessors
On 02/27/2011 06:44 AM, Avi Kivity wrote: On 02/24/2011 11:48 PM, Anthony Liguori wrote: Signed-off-by: Anthony Liguorialigu...@us.ibm.com diff --git a/lib/x86/io.h b/lib/x86/io.h new file mode 100644 index 000..bd6341c --- /dev/null +++ b/lib/x86/io.h @@ -0,0 +1,40 @@ +#ifndef IO_H +#define IO_H + +static inline unsigned char inb(unsigned short port) +{ +unsigned char value; +asm volatile(inb %w1, %0 : =a (value) : Nd (port)); +return value; +} Are those %[wb] really needed? gcc should do the right thing based on the argument type. It's just a little extra type safety. Might as well put all of that into processor.h. Seems reasonable. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/3] kvm-unit-tests: add x86 port io accessors
On 02/27/2011 04:01 PM, Anthony Liguori wrote: On 02/27/2011 06:44 AM, Avi Kivity wrote: On 02/24/2011 11:48 PM, Anthony Liguori wrote: Signed-off-by: Anthony Liguorialigu...@us.ibm.com diff --git a/lib/x86/io.h b/lib/x86/io.h new file mode 100644 index 000..bd6341c --- /dev/null +++ b/lib/x86/io.h @@ -0,0 +1,40 @@ +#ifndef IO_H +#define IO_H + +static inline unsigned char inb(unsigned short port) +{ +unsigned char value; +asm volatile(inb %w1, %0 : =a (value) : Nd (port)); +return value; +} Are those %[wb] really needed? gcc should do the right thing based on the argument type. It's just a little extra type safety. Yeah, but those constraints aren't really documented. Linux does use them in a handful of places so they're unlikely to go away though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: VMX: write new TR selector value into vmcs immediately if it changes during vm86 mode.
On 02/21/2011 12:07 PM, Gleb Natapov wrote: When vm86 is active TR descriptor is updated with vm86 task values, but selector is left intact. vmx_set_segment() makes sure that if TR register is written into while vm86 is active the new values are saved for use after vm86 is deactivated, but since selector is not updated on vm86 activation/deactivation new value is lost. Fix this by writing new selector into vmcs immediately. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On 02/21/2011 12:07 PM, Gleb Natapov wrote: Currently vm86 task is initialized on each real mode entry and vcpu reset. Initialization is done by zeroing TSS and updating relevant fields. But since all vcpus are using the same TSS there is a race where one vcpu may use TSS while other vcpu is initializing it, so the vcpu that uses TSS will see wrong TSS content and will behave incorrectly. Fix that by initializing TSS only once. Applied, thanks. According to my reading of the code, if KVM_SET_TSS_ADDR is not invoked, the guest would fail both before and after the patch, yes? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On Sun, Feb 27, 2011 at 05:43:07PM +0200, Avi Kivity wrote: On 02/21/2011 12:07 PM, Gleb Natapov wrote: Currently vm86 task is initialized on each real mode entry and vcpu reset. Initialization is done by zeroing TSS and updating relevant fields. But since all vcpus are using the same TSS there is a race where one vcpu may use TSS while other vcpu is initializing it, so the vcpu that uses TSS will see wrong TSS content and will behave incorrectly. Fix that by initializing TSS only once. Applied, thanks. According to my reading of the code, if KVM_SET_TSS_ADDR is not invoked, the guest would fail both before and after the patch, yes? Hmmm. Actually no. Before the patch guest that doesn't use KVM_SET_TSS_ADDR will use the top of slot zero. Should I fix that (how?), or should we drop support for those old guests? The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On 02/27/2011 05:52 PM, Gleb Natapov wrote: According to my reading of the code, if KVM_SET_TSS_ADDR is not invoked, the guest would fail both before and after the patch, yes? Hmmm. Actually no. Before the patch guest that doesn't use KVM_SET_TSS_ADDR will use the top of slot zero. Should I fix that (how?), or should we drop support for those old guests? I don't think we have a problem with older qemus, but perhaps we do with non-qemu users. The API clearly requires the ioctl to be called, but I don't think we can blame anyone for forgetting to do so, especially if it worked silently. The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. Well, slot 0 still exists even if it is moved somewhere else. Something we can do is put the tss slot just below the highest slot that is still below 4G, and hope there is no mmio there. Once the user issues KVM_SET_TSS_ADDR, use that. We'll have to keep juggling that slot as the user creates more slots, icky. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On 02/27/2011 05:58 PM, Avi Kivity wrote: The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. Well, slot 0 still exists even if it is moved somewhere else. Something we can do is put the tss slot just below the highest slot that is still below 4G, and hope there is no mmio there. Once the user issues KVM_SET_TSS_ADDR, use that. We'll have to keep juggling that slot as the user creates more slots, icky. Or we can keep the old behaviour. If KVM_SET_TSS_ADDR hasn't been called by the time of the first entry into real mode (the first KVM_CREATE_VCPU?), use the top of the first slot. We can avoid the SMP problem by initializing the memory in a single pass, writing each byte exactly once with its final value. This way concurrent initialization doesn't corrupt an in-use TSS. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On Sun, Feb 27, 2011 at 05:58:54PM +0200, Avi Kivity wrote: On 02/27/2011 05:52 PM, Gleb Natapov wrote: According to my reading of the code, if KVM_SET_TSS_ADDR is not invoked, the guest would fail both before and after the patch, yes? Hmmm. Actually no. Before the patch guest that doesn't use KVM_SET_TSS_ADDR will use the top of slot zero. Should I fix that (how?), or should we drop support for those old guests? I don't think we have a problem with older qemus, but perhaps we do with non-qemu users. The API clearly requires the ioctl to be called, but I don't think we can blame anyone for forgetting to do so, especially if it worked silently. It may have worked as in no error returned from KVM_RUN, but if userspace does not call to KVM_SET_TSS_ADDR kvm silently uses part of a guest memory to store its data which may cause guest to fail long after it was started. It is true that for that to happen guest needs to enter protected mode during its lifetime and not many guests do that usually. The only cases I can think of are during some guests installation and S3 suspend/resume. The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. Well, slot 0 still exists even if it is moved somewhere else. Something we can do is put the tss slot just below the highest slot that is still below 4G, and hope there is no mmio there. Once the user issues KVM_SET_TSS_ADDR, use that. We'll have to keep juggling that slot as the user creates more slots, icky. I have a question about our current placement of tss addr. Qemu-kvm places it at 4G-16M and comment says that this is just below BIOS ROM, but BIOS ROM takes only upper 128K. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On 02/27/2011 06:12 PM, Gleb Natapov wrote: On Sun, Feb 27, 2011 at 05:58:54PM +0200, Avi Kivity wrote: On 02/27/2011 05:52 PM, Gleb Natapov wrote: According to my reading of the code, if KVM_SET_TSS_ADDR is not invoked, the guest would fail both before and after the patch, yes? Hmmm. Actually no. Before the patch guest that doesn't use KVM_SET_TSS_ADDR will use the top of slot zero. Should I fix that (how?), or should we drop support for those old guests? I don't think we have a problem with older qemus, but perhaps we do with non-qemu users. The API clearly requires the ioctl to be called, but I don't think we can blame anyone for forgetting to do so, especially if it worked silently. It may have worked as in no error returned from KVM_RUN, but if userspace does not call to KVM_SET_TSS_ADDR kvm silently uses part of a guest memory to store its data which may cause guest to fail long after it was started. It is true that for that to happen guest needs to enter protected mode during its lifetime and not many guests do that usually. The only cases I can think of are during some guests installation and S3 suspend/resume. Right. I prefer to keep this partially working state if users didn't have a problem with it. The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. Well, slot 0 still exists even if it is moved somewhere else. Something we can do is put the tss slot just below the highest slot that is still below 4G, and hope there is no mmio there. Once the user issues KVM_SET_TSS_ADDR, use that. We'll have to keep juggling that slot as the user creates more slots, icky. I have a question about our current placement of tss addr. Qemu-kvm places it at 4G-16M and comment says that this is just below BIOS ROM, but BIOS ROM takes only upper 128K. Are you surprised that a comment is inaccurate? Likely it was moved to make room for larger bioses. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm,async_pf: add missing kvm_async_pf_hash_reset()
On 02/21/2011 05:21 AM, Lai Jiangshan wrote: The hash array of async gfns may still contain some left gfns after kvm_clear_async_pf_completion_queue() called, need to clear them. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On Sun, Feb 27, 2011 at 06:04:16PM +0200, Avi Kivity wrote: On 02/27/2011 05:58 PM, Avi Kivity wrote: The problem with using top of slot zero is that this memory is available for guest use and we do not even put it into e820 map as far as I see. Also there are patches floating around that re-arrange memslots or even put them in a tree. They will break old guests too. Well, slot 0 still exists even if it is moved somewhere else. Something we can do is put the tss slot just below the highest slot that is still below 4G, and hope there is no mmio there. Once the user issues KVM_SET_TSS_ADDR, use that. We'll have to keep juggling that slot as the user creates more slots, icky. Or we can keep the old behaviour. If KVM_SET_TSS_ADDR hasn't been called by the time of the first entry into real mode (the first KVM_CREATE_VCPU?), use the top of the first slot. Do we require that KVM_SET_TSS_ADDR is called before first KVM_CREATE_VCPU? We may still break some theoretical userspaces this way. We can avoid the SMP problem by initializing the memory in a single pass, writing each byte exactly once with its final value. This way concurrent initialization doesn't corrupt an in-use TSS. Sounds hackish, but may work. Doing so will make entering pmode much more slow. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: status of kvm/ subdir in qemu-kvm tarball?
On 02/23/2011 04:57 PM, Jan Kiszka wrote: The only situation where using kernel headers not provided by qemu itself is when you want to build qemu for older kernel and omit some features and runtime tests. But this is hardly a good goal to support. I don't think the goal of the #ifdefs was ever some kind of size optimization but just for the sake of avoiding build breakages. Yes. Well, rm -r kvm is a rather minor thing. But the header discussion is important IMO as we continue to see breakages in KVM builds (mostly qemu upstream) due to missing #ifdefs. That means: - we do not test all variants (because it's impractical) We could teach buildbot about it. - people use older headers - too much effort is wasted on fixing distributed problems that can be solved centrally Yes. But on the other hand carrying headers is the Wrong Thing, isn't it? If everyone did that we'd be in a mess of duplication. I'd like not to contribute to that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On 02/27/2011 06:27 PM, Gleb Natapov wrote: Or we can keep the old behaviour. If KVM_SET_TSS_ADDR hasn't been called by the time of the first entry into real mode (the first KVM_CREATE_VCPU?), use the top of the first slot. Do we require that KVM_SET_TSS_ADDR is called before first KVM_CREATE_VCPU? We may still break some theoretical userspaces this way. The documentation doesn't say, but I think it's unlikely that userspace will start a guest and then configure it. We can avoid the SMP problem by initializing the memory in a single pass, writing each byte exactly once with its final value. This way concurrent initialization doesn't corrupt an in-use TSS. Sounds hackish, but may work. Doing so will make entering pmode much more slow. Why? do it at most once per vcpu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: VMX: Initialize vm86 TSS only once.
On Sun, Feb 27, 2011 at 06:31:13PM +0200, Avi Kivity wrote: On 02/27/2011 06:27 PM, Gleb Natapov wrote: Or we can keep the old behaviour. If KVM_SET_TSS_ADDR hasn't been called by the time of the first entry into real mode (the first KVM_CREATE_VCPU?), use the top of the first slot. Do we require that KVM_SET_TSS_ADDR is called before first KVM_CREATE_VCPU? We may still break some theoretical userspaces this way. The documentation doesn't say, but I think it's unlikely that userspace will start a guest and then configure it. KVM_CREATE_VCPU is done before guest is started. Starting of a guest happens when KVM_RUN is done for the first time. We can avoid the SMP problem by initializing the memory in a single pass, writing each byte exactly once with its final value. This way concurrent initialization doesn't corrupt an in-use TSS. Sounds hackish, but may work. Doing so will make entering pmode much more slow. Why? do it at most once per vcpu. The reboot may fail since guest may have used the memory while running, so memory should be reinited on reboot, but since KVM doesn't always know about reboot it should be done at each protected mode entry. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug inkvm_set_irq
On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote: Hi, Each time i try tou use vhost_net, i'm facing a kernel bug. I do a modprobe vhost_net, and start guest whith vhost=on. Following is a trace with a kernel 2.6.37, but i had the same problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29). 2.6.36 had a theorectical race that could explain this, but it should be ok in 2.6.37. The bug only occurs whith vhost_net charged, so i don't know if this is a bug in kvm module code or in the vhost_net code. It could be a bug in eventfd which is the interface used by both kvm and vhost_net. Just for fun, you can try 3.6.38 - eventfd code has been changed a lot in 2.6.38 and if it does not trigger there it's a hint that irqfd is the reason. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243100] BUG: unable to handle kernel paging request at 2458 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243250] IP: [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Could you run markup_oops/ ksymoops on this please? As far as I can see kvm_set_irq can only get a wrong kvm pointer. Unless there's some general memory corruption, I'd guess You can also try comparing the irqfd-kvm pointer in kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in virt/kvm/eventfd.c. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243556] Oops: [#1] SMP Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243692] last sysfs file: /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243820] Modules linked in: vhost_net macvtap macvlan tun powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand fre q_table cpufreq_conservative fuse xt_physdev ip6t_LOG ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp nf_connt rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm ps mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses sd_mod enclosu re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2 scsi_mod ehci_hcd [last unloaded: scsi_wait_scan] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Pid: 10, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RIP: 0010:[a041aa8a] [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RSP: 0018:88045fc89d30 EFLAGS: 00010246 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RAX: RBX: 001a RCX: 0001 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RDX: RSI: RDI: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RBP: R08: 0001 R09: 880856a91e48 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R10: R11: R12: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R13: 0001 R14: R15: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] FS: 7f617986c710() GS:88007f80() knlGS: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CS: 0010 DS: ES: CR0: 8005003b Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CR2: 2458 CR3: 00045d197000 CR4: 06f0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR0: DR1: DR2: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR3: DR6: 0ff0 DR7: 0400 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Process kworker/0:1 (pid: 10, threadinfo 88045fc88000, task 88085fc53c30) Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] 88045fc89fd8 000119c0 88045fc88010 88085fc53ee8 Feb 23 13:56:19
Re: regression - 2.6.36 - 2.6.37 - kvm - 32bit SMP guests don't boot
I was not aware of the thread. Please cc me directly, or add a keyword I track - timekeeping, TSC.. Hello Zachary, thanks for Your time looking at this! That change alone may not bisect well; without further fixes on top of it, you may end up with a hang or stall, which is likely to manifest in a vendor-specific way. I'm not sure I really understand You here, but this change is exactly to what I got while bisecting. With later revisions, including this one, 32bit SMP guests don't boot, before it, they do.. Basically there were a few differences in the platform code about how TSC was dealt with on systems which did not have stable clocks, this brought the logic into one location, but there was a slight change to the logic here. Note very carefully, the logic on SVM is gated by a condition before this change: if (unlikely(cpu != vcpu-cpu)) { - u64 delta; - - if (check_tsc_unstable()) { - /* -* Make sure that the guest sees a monotonically -* increasing TSC. -*/ - delta = vcpu-arch.host_tsc - native_read_tsc(); - svm-vmcb-control.tsc_offset += delta; - if (is_nested(svm)) - svm-nested.hsave-control.tsc_offset += delta; - } - vcpu-cpu = cpu; - kvm_migrate_timers(vcpu); So this only happens with a system which reports TSC as unstable. After the change, KVM itself may report the TSC as unstable: + if (unlikely(vcpu-cpu != cpu)) { + /* Make sure TSC doesn't go backwards */ + s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : + native_read_tsc() - vcpu-arch.last_host_tsc; + if (tsc_delta 0) + mark_tsc_unstable(KVM discovered backwards TSC); + if (check_tsc_unstable()) + kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta); + kvm_migrate_timers(vcpu); + vcpu-cpu = cpu; + } If the platform has very small TSC deltas across CPUs, but indicates the TSC is stable, this could result in KVM marking the TSC unstable. If that is the case, this compensation logic will kick in to avoid backwards TSCs. Note however, that the logic is not perfect; time which passes while not running on any CPU will be erased, as the delta compensation removes not just backwards, but any elapsed time from the TSC. In extreme cases, this could result in time appearing to stand still with guests failing to boot. This was addressed with a later change, which catches up the missing time: commit c285545f813d7b0ce989fd34e42ad1fe785dc65d yes, but this change is already included in 2.6.37, so maybe some other fix is needed? if You have some idea what could be changed, I'll gladly test whatever You recommend, but I'm afraid that's all I can do, since this is a bit of a rocket science for me, sorry :( nik Author: Zachary Amsden zams...@redhat.com Date: Sat Sep 18 14:38:15 2010 -1000 KVM: x86: TSC catchup mode Negate the effects of AN TYM spell while kvm thread is preempted by tracking conversion factor to the highest TSC rate and catching the TSC up when it has fallen behind the kernel view of time. Note that once triggered, we don't turn off catchup mode. A slightly more clever version of this is possible, which only does catchup when TSC rate drops, and which specifically targets only CPUs with broken TSC, but since these all are considered unstable_tsc(), this patch covers all necessary cases. Signed-off-by: Zachary Amsden zams...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: status of kvm/ subdir in qemu-kvm tarball?
On 2011-02-27 17:28, Avi Kivity wrote: On 02/23/2011 04:57 PM, Jan Kiszka wrote: The only situation where using kernel headers not provided by qemu itself is when you want to build qemu for older kernel and omit some features and runtime tests. But this is hardly a good goal to support. I don't think the goal of the #ifdefs was ever some kind of size optimization but just for the sake of avoiding build breakages. Yes. Well, rm -r kvm is a rather minor thing. But the header discussion is important IMO as we continue to see breakages in KVM builds (mostly qemu upstream) due to missing #ifdefs. That means: - we do not test all variants (because it's impractical) We could teach buildbot about it. We could, but the test matrix wouldn't be simple. Think of CAPs that depend on other CAPs. I don't think it would be worth the effort. And if we forget to update a test after adding a CAP, the situation won't be different from today. - people use older headers - too much effort is wasted on fixing distributed problems that can be solved centrally Yes. But on the other hand carrying headers is the Wrong Thing, isn't it? Well, there are also different views on this, see e.g. http://kernelnewbies.org/KernelHeaders If everyone did that we'd be in a mess of duplication. I'd like not to contribute to that. Not all distros keep their kernel headers sufficiently recent and/or users forget to update them when updating the kernel (I just noticed that vhost remained off in some builds here due to pre-historic distro headers). As long as we continue adding or changing the kernel ABI, qemu will face these problems. We can't avoid a certain degree of mess, either in form of in-tree header copies or via a continuously increasing number of #ifdefs in our code. I'm really convinced now the former is both more user-friendly and cleaner than the latter. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). qemu-kvm 0.14 startup line /usr/bin/kvm -name spaceball,process=spaceball -m 1024 -kernel /boot/bzImage-2.6.37.2-guest -append root=/dev/vda ro -smp 1 -netdev type=tap,id=spaceball0,script=kvm-ifup-brloc,vhost=on -device virtio-net-pci,netdev=spaceball0,mac=00:16:3e:00:08:01 -drive file=/dev/volume01/G-spaceball,if=virtio -vga qxl -spice port=5957,disable-ticketing -monitor telnet:192.168.0.254:10007,server,nowait,nodelay -pidfile /var/run/kvm/spaceball.pid host is running vanilla 2.6.37.1 on amd64. Here is the bt # gdb /usr/bin/qemu-system-x86_64 GNU gdb (Gentoo 7.2 p1) 7.2 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as x86_64-pc-linux-gnu. For bug reporting instructions, please see: http://bugs.gentoo.org/... Reading symbols from /usr/bin/qemu-system-x86_64...done. (gdb) set args -name spaceball,process=spaceball -m 1024 -kernel /boot/bzImage-2.6.37.2-guest -append root=/dev/vda ro -smp 1 -netdev type=tap,id=spaceball0,script=kvm-ifup-brloc,vhost=on -device virtio-net-pci,netdev=spaceball0,mac=00:16:3e:00:08:01 -drive file=/dev/volume01/G-spaceball,if=virtio -vga qxl -spice port=5957,disable-ticketing -monitor telnet:192.168.0.254:10007,server,nowait,nodelay -pidfile /var/run/kvm/spaceball.pid (gdb) run Starting program: /usr/bin/qemu-system-x86_64 -name spaceball,process=spaceball -m 1024 -kernel /boot/bzImage-2.6.37.2-guest -append root=/dev/vda ro -smp 1 -netdev type=tap,id=spaceball0,script=kvm-ifup-brloc,vhost=on -device virtio-net-pci,netdev=spaceball0,mac=00:16:3e:00:08:01 -drive file=/dev/volume01/G-spaceball,if=virtio -vga qxl -spice port=5957,disable-ticketing -monitor telnet:192.168.0.254:10007,server,nowait,nodelay -pidfile /var/run/kvm/spaceball.pid [Thread debugging using libthread_db enabled] do_spice_init: starting 0.6.0 spice_server_add_interface: SPICE_INTERFACE_KEYBOARD spice_server_add_interface: SPICE_INTERFACE_MOUSE [New Thread 0x74802710 (LWP 30294)] spice_server_add_interface: SPICE_INTERFACE_QXL [New Thread 0x7fffaacae710 (LWP 30295)] red_worker_main: begin handle_dev_destroy_surfaces: handle_dev_destroy_surfaces: handle_dev_input: start [New Thread 0x7fffaa4ad710 (LWP 30298)] [New Thread 0x7fffa9cac710 (LWP 30299)] [New Thread 0x7fffa94ab710 (LWP 30300)] [New Thread 0x7fffa8caa710 (LWP 30301)] [New Thread 0x7fffa3fff710 (LWP 30302)] [New Thread 0x7fffa37fe710 (LWP 30303)] [New Thread 0x7fffa2ffd710 (LWP 30304)] [New Thread 0x7fffa27fc710 (LWP 30305)] [New Thread 0x7fffa1ffb710 (LWP 30306)] [New Thread 0x7fffa17fa710 (LWP 30307)] reds_handle_main_link: reds_show_new_channel: channel 1:0, connected successfully, over Non Secure link reds_main_handle_message: net test: latency 5.636000 ms, bitrate 11027768 bps (10.516899 Mbps) reds_show_new_channel: channel 2:0, connected successfully, over Non Secure link red_dispatcher_set_peer: handle_dev_input: connect handle_new_display_channel: jpeg disabled handle_new_display_channel: zlib-over-glz disabled reds_show_new_channel: channel 4:0, connected successfully, over Non Secure link red_dispatcher_set_cursor_peer: handle_dev_input: cursor connect reds_show_new_channel: channel 3:0, connected successfully, over Non Secure link inputs_link: [New Thread 0x7fffa07f8710 (LWP 30312)] [New Thread 0x7fff9fff7710 (LWP 30313)] [New Thread 0x7fff9f7f6710 (LWP 30314)] [New Thread 0x7fff9eff5710 (LWP 30315)] [New Thread 0x7fff9e7f4710 (LWP 30316)] [New Thread 0x7fff9dff3710 (LWP 30317)] [New Thread 0x7fff9d7f2710 (LWP 30318)] qemu-system-x86_64: /var/tmp/portage/app-emulation/qemu-kvm-0.14.0/work/qemu-kvm-0.14.0/qemu-kvm.c:1724: kvm_mutex_unlock: Assertion `!cpu_single_env' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x74802710 (LWP 30294)] 0x75daa165 in raise () from /lib/libc.so.6 (gdb) (gdb) (gdb) (gdb) (gdb) bt #0 0x75daa165 in raise () from /lib/libc.so.6 #1 0x75dab580 in abort () from /lib/libc.so.6 #2 0x75da3201 in __assert_fail () from /lib/libc.so.6 #3 0x00436f7e in kvm_mutex_unlock () at /var/tmp/portage/app-emulation/qemu-kvm-0.14.0/work/qemu-kvm-0.14.0/qemu-kvm.c:1724 #4 qemu_mutex_unlock_iothread ()
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On 2011-02-27 20:03, Alon Levy wrote: On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). Patch looks OK on first glance, but the changelog is misleading: This was broken for _both_ trees, but upstream didn't detect the bug. My concerns regarding other side effects of juggling with global mutex in spice code remain. Jan signature.asc Description: OpenPGP digital signature
[PATCH] spice/qxl: locking fix for qemu-kvm
From: Gerd Hoffmann kra...@redhat.com qxl needs to release the qemu lock before calling some libspice functions (and re-aquire it later). In upstream qemu qxl can just use qemu_mutex_{unlock,lock}_iothread. In qemu-kvm this doesn't work, qxl needs additionally save+restore the cpu_single_env pointer on unlock+lock. Signed-off-by: Gerd Hoffmann kra...@redhat.com --- This is commit 15ba0114aa3ddedbeb2519cb0a8755a05f0a1d38 from rhel repo. --- hw/qxl.c | 37 + ui/spice-display.c | 12 ++-- ui/spice-display.h |6 ++ 3 files changed, 41 insertions(+), 14 deletions(-) diff --git a/hw/qxl.c b/hw/qxl.c index fe4212b..117f7c8 100644 --- a/hw/qxl.c +++ b/hw/qxl.c @@ -125,6 +125,27 @@ static void qxl_reset_memslots(PCIQXLDevice *d); static void qxl_reset_surfaces(PCIQXLDevice *d); static void qxl_ring_set_dirty(PCIQXLDevice *qxl); +/* qemu-kvm locking ... */ +void qxl_unlock_iothread(SimpleSpiceDisplay *ssd) +{ +if (cpu_single_env) { +assert(ssd-env == NULL); +ssd-env = cpu_single_env; +cpu_single_env = NULL; +} +qemu_mutex_unlock_iothread(); +} + +void qxl_lock_iothread(SimpleSpiceDisplay *ssd) +{ +qemu_mutex_lock_iothread(); +if (ssd-env) { +assert(cpu_single_env == NULL); +cpu_single_env = ssd-env; +ssd-env = NULL; +} +} + static inline uint32_t msb_mask(uint32_t val) { uint32_t mask; @@ -662,10 +683,10 @@ static void qxl_hard_reset(PCIQXLDevice *d, int loadvm) dprint(d, 1, %s: start%s\n, __FUNCTION__, loadvm ? (loadvm) : ); -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(d-ssd); d-ssd.worker-reset_cursor(d-ssd.worker); d-ssd.worker-reset_image_cache(d-ssd.worker); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(d-ssd); qxl_reset_surfaces(d); qxl_reset_memslots(d); @@ -795,9 +816,9 @@ static void qxl_reset_surfaces(PCIQXLDevice *d) { dprint(d, 1, %s:\n, __FUNCTION__); d-mode = QXL_MODE_UNDEFINED; -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(d-ssd); d-ssd.worker-destroy_surfaces(d-ssd.worker); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(d-ssd); memset(d-guest_surfaces.cmds, 0, sizeof(d-guest_surfaces.cmds)); } @@ -866,9 +887,9 @@ static void qxl_destroy_primary(PCIQXLDevice *d) dprint(d, 1, %s\n, __FUNCTION__); d-mode = QXL_MODE_UNDEFINED; -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(d-ssd); d-ssd.worker-destroy_primary_surface(d-ssd.worker, 0); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(d-ssd); } static void qxl_set_mode(PCIQXLDevice *d, int modenr, int loadvm) @@ -938,10 +959,10 @@ static void ioport_write(void *opaque, uint32_t addr, uint32_t val) case QXL_IO_UPDATE_AREA: { QXLRect update = d-ram-update_area; -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(d-ssd); d-ssd.worker-update_area(d-ssd.worker, d-ram-update_surface, update, NULL, 0, 0); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(d-ssd); break; } case QXL_IO_NOTIFY_CMD: diff --git a/ui/spice-display.c b/ui/spice-display.c index 020b423..defe652 100644 --- a/ui/spice-display.c +++ b/ui/spice-display.c @@ -186,18 +186,18 @@ void qemu_spice_create_host_primary(SimpleSpiceDisplay *ssd) surface.mem= (intptr_t)ssd-buf; surface.group_id = MEMSLOT_GROUP_HOST; -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(ssd); ssd-worker-create_primary_surface(ssd-worker, 0, surface); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(ssd); } void qemu_spice_destroy_host_primary(SimpleSpiceDisplay *ssd) { dprint(1, %s:\n, __FUNCTION__); -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(ssd); ssd-worker-destroy_primary_surface(ssd-worker, 0); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(ssd); } void qemu_spice_vm_change_state_handler(void *opaque, int running, int reason) @@ -207,9 +207,9 @@ void qemu_spice_vm_change_state_handler(void *opaque, int running, int reason) if (running) { ssd-worker-start(ssd-worker); } else { -qemu_mutex_unlock_iothread(); +qxl_unlock_iothread(ssd); ssd-worker-stop(ssd-worker); -qemu_mutex_lock_iothread(); +qxl_lock_iothread(ssd); } ssd-running = running; } diff --git a/ui/spice-display.h b/ui/spice-display.h index aef0464..df74828 100644 --- a/ui/spice-display.h +++ b/ui/spice-display.h @@ -43,6 +43,9 @@ typedef struct SimpleSpiceDisplay { QXLRect dirty; int notify; int running; + +/* qemu-kvm locking ... */ +void *env; } SimpleSpiceDisplay; typedef struct SimpleSpiceUpdate { @@ -52,6 +55,9 @@ typedef struct SimpleSpiceUpdate { uint8_t *bitmap; } SimpleSpiceUpdate; +void qxl_unlock_iothread(SimpleSpiceDisplay *ssd); +void
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On Sun, Feb 27, 2011 at 08:11:26PM +0100, Jan Kiszka wrote: On 2011-02-27 20:03, Alon Levy wrote: On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). Patch looks OK on first glance, but the changelog is misleading: This was broken for _both_ trees, but upstream didn't detect the bug. The trees the patch commit message refers to are qemu and qemu-kvm. qemu doesn't even have cpu_single_env. It didn't talk about two qemu-kvm trees. My concerns regarding other side effects of juggling with global mutex in spice code remain. I know there used to be a mutex in spice code and during the upstreaming process it got ditched in favor of the qemu global io mutex. I would have rather deferred this to Gerd since he wrote this, but he is not available atm. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On 2011-02-27 20:16, Alon Levy wrote: On Sun, Feb 27, 2011 at 08:11:26PM +0100, Jan Kiszka wrote: On 2011-02-27 20:03, Alon Levy wrote: On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). Patch looks OK on first glance, but the changelog is misleading: This was broken for _both_ trees, but upstream didn't detect the bug. The trees the patch commit message refers to are qemu and qemu-kvm. The same did I. qemu doesn't even have cpu_single_env. Really? Check again. :) It didn't talk about two qemu-kvm trees. My concerns regarding other side effects of juggling with global mutex in spice code remain. I know there used to be a mutex in spice code and during the upstreaming process it got ditched in favor of the qemu global io mutex. I would have rather deferred this to Gerd since he wrote this, but he is not available atm. It's not necessarily bad to drop the io mutex, but it is more tricky than it may appear on first glance. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On Sun, Feb 27, 2011 at 08:27:01PM +0100, Jan Kiszka wrote: On 2011-02-27 20:16, Alon Levy wrote: On Sun, Feb 27, 2011 at 08:11:26PM +0100, Jan Kiszka wrote: On 2011-02-27 20:03, Alon Levy wrote: On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). Patch looks OK on first glance, but the changelog is misleading: This was broken for _both_ trees, but upstream didn't detect the bug. The trees the patch commit message refers to are qemu and qemu-kvm. The same did I. qemu doesn't even have cpu_single_env. Really? Check again. :) Sorry, grepped the wrong repo. I'll send this to qemu-devel too then. It didn't talk about two qemu-kvm trees. My concerns regarding other side effects of juggling with global mutex in spice code remain. I know there used to be a mutex in spice code and during the upstreaming process it got ditched in favor of the qemu global io mutex. I would have rather deferred this to Gerd since he wrote this, but he is not available atm. It's not necessarily bad to drop the io mutex, but it is more tricky than it may appear on first glance. Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: kvm crashes with spice while loading qxl
On Sun, Feb 27, 2011 at 08:27:01PM +0100, Jan Kiszka wrote: On 2011-02-27 20:16, Alon Levy wrote: On Sun, Feb 27, 2011 at 08:11:26PM +0100, Jan Kiszka wrote: On 2011-02-27 20:03, Alon Levy wrote: On Sat, Feb 26, 2011 at 01:29:01PM +0100, Jan Kiszka wrote: On 2011-02-26 12:43, xming wrote: When trying to start X (and it loads qxl driver) the kvm process just crashes. This is fixed by Gerd's attached patch (taken from rhel repository, don't know why it wasn't pushed to qemu-kvm upstream). I'll send it to kvm list as well (separate email). Patch looks OK on first glance, but the changelog is misleading: This was broken for _both_ trees, but upstream didn't detect the bug. The trees the patch commit message refers to are qemu and qemu-kvm. The same did I. qemu doesn't even have cpu_single_env. Really? Check again. :) It didn't talk about two qemu-kvm trees. My concerns regarding other side effects of juggling with global mutex in spice code remain. I know there used to be a mutex in spice code and during the upstreaming process it got ditched in favor of the qemu global io mutex. I would have rather deferred this to Gerd since he wrote this, but he is not available atm. It's not necessarily bad to drop the io mutex, but it is more tricky than it may appear on first glance. The problem with not dropping it is that we may be in vga mode and create updates synthtically (i.e. qemu created and not driver created) that access the framebuffer and need to be locked so the framebuffer isn't updated at the same time. We drop the mutex only when we are about to call the dispatcher, which basically waits on red_worker (a libspice-server thread) to do some work. red_worker may in turn callback into qxl in qemu, which may try to acquire the lock. (the many may's here are just reflections of the codepaths). Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 30042] New: page allocation failure when using jumbo frames with kvm guests
https://bugzilla.kernel.org/show_bug.cgi?id=30042 Summary: page allocation failure when using jumbo frames with kvm guests Product: Virtualization Version: unspecified Kernel Version: 2.6.35 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: low Priority: P1 Component: kvm AssignedTo: virtualization_...@kernel-bugs.osdl.org ReportedBy: kernelbug_...@russenmafia.at Regression: No i've got a intel e1000 based network card in my ubuntu 10.10 host, using jumbo frames (mtu 9000). my kvm guest (ubuntu 10.10) i'm using the virtio network card - also configured with jumbo frames (mtu 9000). netio shows me about 190MB/s, which is almost double the performance i get from the standard mtu size (1500). the only problems i get is when rtorrent starts hash checking. it's awful slow, cpu at 100% idle, and kernel errors appearing. Feb 27 21:09:48 vmbuntu kernel: [348422.309070] rtorrent: page allocation failure. order:2, mode:0x4020 Feb 27 21:09:48 vmbuntu kernel: [348422.309075] Pid: 984, comm: rtorrent Not tainted 2.6.35-25-virtual #44-Ubuntu Feb 27 21:09:48 vmbuntu kernel: [348422.309078] Call Trace: Feb 27 21:09:48 vmbuntu kernel: [348422.309087] [81107956] __alloc_pages_slowpath+0x4b6/0x5a0 Feb 27 21:09:48 vmbuntu kernel: [348422.309091] [81107bac] __alloc_pages_nodemask+0x16c/0x1d0 Feb 27 21:09:48 vmbuntu kernel: [348422.309096] [8113f742] kmalloc_large_node+0x62/0xb0 Feb 27 21:09:48 vmbuntu kernel: [348422.309100] [8114322c] __kmalloc_node_track_caller+0x11c/0x1d0 Feb 27 21:09:48 vmbuntu kernel: [348422.309105] [814f1f81] ? sk_stream_alloc_skb+0x41/0x110 Feb 27 21:09:48 vmbuntu kernel: [348422.309109] [814a8033] __alloc_skb+0x83/0x170 Feb 27 21:09:48 vmbuntu kernel: [348422.309112] [814f1f81] sk_stream_alloc_skb+0x41/0x110 Feb 27 21:09:48 vmbuntu kernel: [348422.309116] [814f5868] tcp_sendmsg+0x3c8/0xa60 Feb 27 21:09:48 vmbuntu kernel: [348422.309121] [81071096] ? mod_timer+0x176/0x2d0 Feb 27 21:09:48 vmbuntu kernel: [348422.309126] [814a07d3] sock_sendmsg+0xf3/0x120 Feb 27 21:09:48 vmbuntu kernel: [348422.309129] [814f5c28] ? tcp_sendmsg+0x788/0xa60 Feb 27 21:09:48 vmbuntu kernel: [348422.309134] [812975f6] ? blkcipher_walk_next+0x1b6/0x3e0 Feb 27 21:09:48 vmbuntu kernel: [348422.309137] [814a07d3] ? sock_sendmsg+0xf3/0x120 Feb 27 21:09:48 vmbuntu kernel: [348422.309141] [814a0841] kernel_sendmsg+0x41/0x60 Feb 27 21:09:48 vmbuntu kernel: [348422.309160] [a0052bee] xs_send_kvec+0x8e/0xa0 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309177] [a01158be] ? reserve_space+0xe/0x20 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309185] [a0052db2] xs_sendpages+0x1b2/0x210 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309194] [a0052f4d] xs_tcp_send_request+0x5d/0x170 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309202] [a0050fb3] xprt_transmit+0x83/0x2f0 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309210] [a004e118] call_transmit+0xa8/0x130 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309219] [a00564a2] __rpc_execute+0xa2/0x290 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309227] [a0056713] rpc_execute+0x83/0xa0 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309235] [a004ec99] rpc_run_task+0x29/0x40 [sunrpc] Feb 27 21:09:48 vmbuntu kernel: [348422.309244] [a010198f] nfs_read_rpcsetup+0x18f/0x1f0 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309254] [a01027b0] ? readpage_async_filler+0x0/0x160 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309264] [a0101dd9] nfs_pagein_one+0xa9/0xf0 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309272] [a00ff6c7] nfs_pageio_doio+0x37/0x80 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309281] [a00ff764] nfs_pageio_add_request+0x54/0x100 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309296] [a0102838] readpage_async_filler+0x88/0x160 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309306] [a01027b0] ? readpage_async_filler+0x0/0x160 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309311] [8110b49a] read_cache_pages+0xba/0x130 Feb 27 21:09:48 vmbuntu kernel: [348422.309425] [a0101676] nfs_readpages+0x146/0x210 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309435] [a0101d30] ? nfs_pagein_one+0x0/0xf0 [nfs] Feb 27 21:09:48 vmbuntu kernel: [348422.309439] [8110b1db] __do_page_cache_readahead+0x13b/0x210 Feb 27 21:09:48 vmbuntu kernel: [348422.309443] [8110b351] force_page_cache_readahead+0x71/0xa0 Feb 27 21:09:48 vmbuntu kernel: [348422.309448] [8111b28b] madvise_vma+0xcb/0x210 Feb 27 21:09:48 vmbuntu kernel: [348422.309452] [811265f5] ? do_mmap_pgoff+0x335/0x380 Feb 27
Re: [PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler
On Friday 25 February 2011 16:12:30 Michael S. Tsirkin wrote: On Fri, Feb 25, 2011 at 11:23:30AM +0800, Sheng Yang wrote: On Thursday 24 February 2011 18:22:19 Michael S. Tsirkin wrote: On Thu, Feb 24, 2011 at 05:51:03PM +0800, Sheng Yang wrote: Add a new parameter to IO writing handler, so that we can transfer information from IO handler to caller. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/i8254.c |6 -- arch/x86/kvm/i8259.c |3 ++- arch/x86/kvm/lapic.c |3 ++- arch/x86/kvm/x86.c| 13 - include/linux/kvm_host.h | 12 ++-- virt/kvm/coalesced_mmio.c |3 ++- virt/kvm/eventfd.c|2 +- virt/kvm/ioapic.c |2 +- virt/kvm/iodev.h |6 -- virt/kvm/kvm_main.c |4 ++-- 10 files changed, 36 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index efad723..bd8f0c5 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr) } static int pit_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = dev_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device *this, } static int speaker_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = speaker_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 3cece05..96b1070 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct kvm_io_device *dev) } static int picdev_write(struct kvm_io_device *this, -gpa_t addr, int len, const void *val) +gpa_t addr, int len, const void *val, +struct kvm_io_ext_data *ext_data) { struct kvm_pic *s = to_pic(this); unsigned char data = *(unsigned char *)val; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 93cf9d0..f413e9c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) } static int apic_mmio_write(struct kvm_io_device *this, - gpa_t address, int len, const void *data) + gpa_t address, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_lapic *apic = to_lapic(this); unsigned int offset = address - apic-base_address; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..21b84e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void) } static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, - const void *v) + const void *v, struct kvm_io_ext_data *ext_data) { if (vcpu-arch.apic - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v)) + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data)) return 0; - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v); + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, + addr, len, v, ext_data); } static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + struct kvm_io_ext_data ext_data; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3825,7 +3827,7 @@ mmio: /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa,
Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel
On Friday 25 February 2011 16:29:38 Michael S. Tsirkin wrote: On Fri, Feb 25, 2011 at 02:28:02PM +0800, Sheng Yang wrote: On Thursday 24 February 2011 18:45:08 Michael S. Tsirkin wrote: On Thu, Feb 24, 2011 at 05:51:04PM +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Signed-off-by: Sheng Yang sh...@linux.intel.com Doesn't look like all comments got addressed. E.g. gpa_t entry_base is still there and in reality you said it's a host virtual address so should be void __user *; Would update it. And ENOTSYNC meaning 'MSIX' is pretty hacky. I'd like to discuss it later. We may need some work on all MMIO handling side to make it more straightforward. But I don't want to bundle it with this one... It's not PCI related so I'll defer to Avi/Marcelo on this. Are you guys happy with the ENOTSYNC meaning 'MSIX' and userspace_exit_needed hacks in this code? --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 40 - include/linux/kvm.h | 28 include/linux/kvm_host.h| 34 + virt/kvm/assigned-dev.c | 44 ++ virt/kvm/kvm_main.c | 38 +- virt/kvm/msix_mmio.c| 296 +++ virt/kvm/msix_mmio.h | 25 10 files changed, 497 insertions(+), 13 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,7 @@ enum emulation_result { EMULATE_DONE, /* no further processing */ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ EMULATE_FAIL, /* can't emulate this instruction */ + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */ }; #define EMULTYPE_NO_DECODE (1 0) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9cafbb4..912dca4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; /* fall through */ + case EMULATE_USERSPACE_EXIT: + /* fall through */ case EMULATE_FAIL: return 0; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21b84e2..87308eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, { gpa_t gpa; struct kvm_io_ext_data ext_data; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) + if (!r) return X86EMUL_CONTINUE; - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-run-mmio.len = vcpu-mmio_size = bytes; -
[PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net
This patch series is a continuation of an earlier one that implemented guest MQ TX functionality. This new patchset implements both RX and TX MQ. Qemu changes are not being included at this time solely to aid in easier review. Compatibility testing with old/new combinations of qemu/guest and vhost was done without any issues. Some early TCP/UDP test results are at the bottom of this post, I plan to submit more test results in the coming days. Please review and provide feedback on what can improve. Thanks! Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- Test configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Each test case runs for 60 secs, results below are average over two runs. Bandwidth numbers are in gbps. I have used default netperf, and no testing/system tuning other than taskset each vhost to 0xf (cpus 0-3). Comparison is testing original kernel vs new kernel with #txqs=8 (# refers to number of netperf sessions). ___ TCP: Guest - Local Host (TCP_STREAM) #BW1BW2 (%) SD1SD2 (%) RSD1RSD2 (%) ___ 17190 7170 (-.2) 0 0 (0) 3 4 (33.3) 28774 11235 (28.0)3 3 (0) 16 14 (-12.5) 49753 15195 (55.7)17 21 (23.5) 65 59 (-9.2) 810224 18265 (78.6)71 115 (61.9) 251 240 (-4.3) 16 10749 18123 (68.6)277456 (64.6) 985 925 (-6.0) 32 11133 17351 (55.8)1132 1947 (71.9)39353831 (-2.6) 64 11223 17115 (52.4)4682 7836 (67.3)15949 15373 (-3.6) 128 11269 16986 (50.7)19783 31505 (59.2) 66799 61759 (-7.5) ___ Summary: BW: 37.6 SD: 61.2 RSD: -6.5 ___ TCP: Local Host - Guest (TCP_MAERTS) #BW1BW2 (%)SD1SD2 (%)RSD1RSD2 (%) ___ 111490 10870 (-5.3) 0 0 (0) 2 2 (0) 210612 10554 (-.5)2 3 (50.0) 12 12 (0) 410047 14320 (42.5) 13 16 (23.0) 53 53 (0) 89273 15182 (63.7) 56 84 (50.0) 228 233 (2.1) 16 9291 15853 (70.6) 235390 (65.9) 934 965 (3.3) 32 9382 15741 (67.7) 9691823 (88.1)38684037 (4.3) 64 9270 14585 (57.3) 3966 8836 (122.7) 15415 17818 (15.5) 128 8997 14678 (63.1) 17024 36649 (115.2) 64933 72677 (11.9) ___ SUM: BW: 24.8 SD: 114.6 RSD: 12.1 __ UDP: Local Host - Guest (UDP_STREAM) # BW1 BW2 (%)SD1SD2 (%) __ 1 1723616585 (-3.7)1 1 (0) 2 1679522693 (35.1)5 6 (20.0) 4 1339021502 (60.5)37 36 (-2.7) 8 1326124361 (83.7)163175 (7.3) 16 1277223796 (86.3)692826 (19.3) 32 1283223880 (86.0)2812 2871 (2.0) 64 1277924293 (90.1)11299 11237 (-.5) 1281300624857 (91.1)44778 43884 (-1.9) __ Summary: BW: 37.1 SD: -1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [RFC] Change virtqueue structure
Move queue_index from virtio_pci_vq_info to virtqueue. This allows callback handlers to figure out the queue number for the vq that needs attention. Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- drivers/virtio/virtio_pci.c | 10 +++--- include/linux/virtio.h |1 + 2 files changed, 4 insertions(+), 7 deletions(-) diff -ruNp org/include/linux/virtio.h new/include/linux/virtio.h --- org/include/linux/virtio.h 2010-10-11 10:20:22.0 +0530 +++ new/include/linux/virtio.h 2011-02-23 16:26:18.0 +0530 @@ -22,6 +22,7 @@ struct virtqueue { void (*callback)(struct virtqueue *vq); const char *name; struct virtio_device *vdev; + int queue_index;/* the index of the queue */ void *priv; }; diff -ruNp org/drivers/virtio/virtio_pci.c new/drivers/virtio/virtio_pci.c --- org/drivers/virtio/virtio_pci.c 2011-01-28 11:38:24.0 +0530 +++ new/drivers/virtio/virtio_pci.c 2011-02-25 10:11:22.0 +0530 @@ -75,9 +75,6 @@ struct virtio_pci_vq_info /* the number of entries in the queue */ int num; - /* the index of the queue */ - int queue_index; - /* the virtual address of the ring queue */ void *queue; @@ -180,11 +177,10 @@ static void vp_reset(struct virtio_devic static void vp_notify(struct virtqueue *vq) { struct virtio_pci_device *vp_dev = to_vp_device(vq-vdev); - struct virtio_pci_vq_info *info = vq-priv; /* we write the queue's selector into the notification register to * signal the other end */ - iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); + iowrite16(vq-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); } /* Handle a configuration change: Tell driver if it wants to know. */ @@ -380,7 +376,6 @@ static struct virtqueue *setup_vq(struct if (!info) return ERR_PTR(-ENOMEM); - info-queue_index = index; info-num = num; info-msix_vector = msix_vec; @@ -403,6 +398,7 @@ static struct virtqueue *setup_vq(struct goto out_activate_queue; } + vq-queue_index = index; vq-priv = info; info-vq = vq; @@ -441,7 +437,7 @@ static void vp_del_vq(struct virtqueue * list_del(info-node); spin_unlock_irqrestore(vp_dev-lock, flags); - iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_SEL); + iowrite16(vq-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_SEL); if (vp_dev-msix_enabled) { iowrite16(VIRTIO_MSI_NO_VECTOR, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [RFC] Changes for MQ virtio-net
Implement mq virtio-net driver. Though struct virtio_net_config changes, it works with the old qemu since the last element is not accessed unless qemu sets VIRTIO_NET_F_NUMTXQS. Patch also adds a macro for the maximum number of TX vq's (VIRTIO_MAX_TXQS) that the user can specify. Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- drivers/net/virtio_net.c | 543 --- include/linux/virtio_net.h |6 2 files changed, 386 insertions(+), 163 deletions(-) diff -ruNp org/include/linux/virtio_net.h new/include/linux/virtio_net.h --- org/include/linux/virtio_net.h 2010-10-11 10:20:22.0 +0530 +++ new/include/linux/virtio_net.h 2011-02-25 16:24:15.0 +0530 @@ -7,6 +7,9 @@ #include linux/virtio_config.h #include linux/if_ether.h +/* Maximum number of individual RX/TX queues supported */ +#define VIRTIO_MAX_TXQS16 + /* The feature bitmap for virtio net */ #define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */ #define VIRTIO_NET_F_GUEST_CSUM1 /* Guest handles pkts w/ partial csum */ @@ -26,6 +29,7 @@ #define VIRTIO_NET_F_CTRL_RX 18 /* Control channel RX mode support */ #define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */ +#define VIRTIO_NET_F_NUMTXQS 21 /* Device supports multiple TX queue */ #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ @@ -34,6 +38,8 @@ struct virtio_net_config { __u8 mac[6]; /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */ __u16 status; + /* number of RX/TX queues */ + __u16 numtxqs; } __attribute__((packed)); /* This is the first element of the scatter-gather list. If you don't diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c --- org/drivers/net/virtio_net.c2011-02-21 17:55:42.0 +0530 +++ new/drivers/net/virtio_net.c2011-02-25 16:23:41.0 +0530 @@ -40,31 +40,53 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; - struct net_device *dev; +/* Internal representation of a send virtqueue */ +struct send_queue { + struct virtqueue *svq; + + /* TX: fragments + linear part + virtio header */ + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; +}; + +/* Internal representation of a receive virtqueue */ +struct receive_queue { + /* Virtqueue associated with this receive_queue */ + struct virtqueue *rvq; + + /* Back pointer to the virtnet_info */ + struct virtnet_info *vi; + struct napi_struct napi; - unsigned int status; /* Number of input buffers, and max we've ever had. */ unsigned int num, max; - /* I like... big packets and I cannot lie! */ - bool big_packets; - - /* Host will merge rx buffers for big packets (shake it! shake it!) */ - bool mergeable_rx_bufs; - /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; /* Chain pages by the private ptr. */ struct page *pages; - /* fragments + linear part + virtio header */ + /* RX: fragments + linear part + virtio header */ struct scatterlist rx_sg[MAX_SKB_FRAGS + 2]; - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; +}; + +struct virtnet_info { + struct send_queue **sq; + struct receive_queue **rq; + + /* read-mostly variables */ + int numtxqs cacheline_aligned_in_smp; /* # of rxqs/txqs */ + struct virtio_device *vdev; + struct virtqueue *cvq; + struct net_device *dev; + unsigned int status; + + /* I like... big packets and I cannot lie! */ + bool big_packets; + + /* Host will merge rx buffers for big packets (shake it! shake it!) */ + bool mergeable_rx_bufs; }; struct skb_vnet_hdr { @@ -94,22 +116,22 @@ static inline struct skb_vnet_hdr *skb_v * private is used to chain pages for big packets, put the whole * most recent used list in the beginning for reuse */ -static void give_pages(struct virtnet_info *vi, struct page *page) +static void give_pages(struct receive_queue *rq, struct page *page) { struct page *end; /* Find end of list, sew whole thing into vi-pages. */ for (end = page; end-private; end = (struct page *)end-private); - end-private = (unsigned long)vi-pages; - vi-pages = page; + end-private = (unsigned long)rq-pages; + rq-pages = page; } -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask) +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) { - struct page *p = vi-pages; + struct page *p = rq-pages; if (p) { - vi-pages = (struct page *)p-private; +
[PATCH 3/3] [RFC] Changes for MQ vhost
Changes for mq vhost. vhost_net_open is changed to allocate a vhost_net and return. The remaining initializations are delayed till SET_OWNER. SET_OWNER is changed so that the argument is used to determine how many txqs to use. Unmodified qemu's will pass NULL, so this is recognized and handled as numtxqs=1. The number of vhost threads is = #txqs. Threads handle more than one txq when #txqs is more than MAX_VHOST_THREADS (4). The same thread handles both RX and TX - tested with tap/bridge so far (TBD: needs some changes in macvtap driver to support the same). I had to convert space-tab in vhost_attach_cgroups* to avoid checkpatch errors. Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- drivers/vhost/net.c | 295 ++-- drivers/vhost/vhost.c | 225 +++--- drivers/vhost/vhost.h | 39 - 3 files changed, 378 insertions(+), 181 deletions(-) diff -ruNp org/drivers/vhost/vhost.h new/drivers/vhost/vhost.h --- org/drivers/vhost/vhost.h 2011-02-08 09:05:09.0 +0530 +++ new/drivers/vhost/vhost.h 2011-02-28 11:48:06.0 +0530 @@ -35,11 +35,11 @@ struct vhost_poll { wait_queue_t wait; struct vhost_work work; unsigned long mask; - struct vhost_dev *dev; + struct vhost_virtqueue*vq; /* points back to vq */ }; void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -unsigned long mask, struct vhost_dev *dev); +unsigned long mask, struct vhost_virtqueue *vq); void vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_flush(struct vhost_poll *poll); @@ -108,8 +108,14 @@ struct vhost_virtqueue { /* Log write descriptors */ void __user *log_base; struct vhost_log *log; + struct task_struct *worker; /* worker for this vq */ + spinlock_t *work_lock; /* points to a dev-work_lock[] entry */ + struct list_head *work_list;/* points to a dev-work_list[] entry */ + int qnum; /* 0 for RX, 1 - n-1 for TX */ }; +#define MAX_VHOST_THREADS 4 + struct vhost_dev { /* Readers use RCU to access memory table pointer * log base pointer and features. @@ -122,12 +128,33 @@ struct vhost_dev { int nvqs; struct file *log_file; struct eventfd_ctx *log_ctx; - spinlock_t work_lock; - struct list_head work_list; - struct task_struct *worker; + spinlock_t *work_lock[MAX_VHOST_THREADS]; + struct list_head *work_list[MAX_VHOST_THREADS]; }; -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs); +/* + * Return maximum number of vhost threads needed to handle RX TX. + * Upto MAX_VHOST_THREADS are started, and threads can be shared + * among different vq's if numtxqs MAX_VHOST_THREADS. + */ +static inline int get_nvhosts(int nvqs) +{ + return min_t(int, nvqs / 2, MAX_VHOST_THREADS); +} + +/* + * Get index of an existing thread that will handle this txq/rxq. + * The same thread handles both rx[index] and tx[index]. + */ +static inline int vhost_get_thread_index(int index, int numtxqs, int nvhosts) +{ + return (index % numtxqs) % nvhosts; +} + +int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs); +void vhost_free_vqs(struct vhost_dev *dev); +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs, + int nvhosts); long vhost_dev_check_owner(struct vhost_dev *); long vhost_dev_reset_owner(struct vhost_dev *); void vhost_dev_cleanup(struct vhost_dev *); diff -ruNp org/drivers/vhost/net.c new/drivers/vhost/net.c --- org/drivers/vhost/net.c 2011-02-08 09:05:09.0 +0530 +++ new/drivers/vhost/net.c 2011-02-28 11:48:53.0 +0530 @@ -32,12 +32,6 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 -enum { - VHOST_NET_VQ_RX = 0, - VHOST_NET_VQ_TX = 1, - VHOST_NET_VQ_MAX = 2, -}; - enum vhost_net_poll_state { VHOST_NET_POLL_DISABLED = 0, VHOST_NET_POLL_STARTED = 1, @@ -46,12 +40,13 @@ enum vhost_net_poll_state { struct vhost_net { struct vhost_dev dev; - struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; - struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct vhost_virtqueue *vqs; + struct vhost_poll *poll; + struct socket **socks; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ - enum vhost_net_poll_state tx_poll_state; + enum vhost_net_poll_state *tx_poll_state; }; /* Pop first len bytes from iovec. Return number of segments used. */ @@ -91,28 +86,28 @@ static void copy_iovec_hdr(const struct } /* Caller must have TX VQ lock */ -static void tx_poll_stop(struct vhost_net *net)
[PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler
Add a new parameter to IO writing handler, so that we can transfer information from IO handler to caller. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/i8254.c |6 -- arch/x86/kvm/i8259.c |3 ++- arch/x86/kvm/lapic.c |3 ++- arch/x86/kvm/x86.c| 13 - include/linux/kvm_host.h |9 +++-- virt/kvm/coalesced_mmio.c |3 ++- virt/kvm/eventfd.c|2 +- virt/kvm/ioapic.c |2 +- virt/kvm/iodev.h |6 -- virt/kvm/kvm_main.c |4 ++-- 10 files changed, 33 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index efad723..bd8f0c5 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr) } static int pit_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = dev_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device *this, } static int speaker_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = speaker_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 3cece05..96b1070 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct kvm_io_device *dev) } static int picdev_write(struct kvm_io_device *this, -gpa_t addr, int len, const void *val) +gpa_t addr, int len, const void *val, +struct kvm_io_ext_data *ext_data) { struct kvm_pic *s = to_pic(this); unsigned char data = *(unsigned char *)val; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 93cf9d0..f413e9c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) } static int apic_mmio_write(struct kvm_io_device *this, - gpa_t address, int len, const void *data) + gpa_t address, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_lapic *apic = to_lapic(this); unsigned int offset = address - apic-base_address; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..21b84e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void) } static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, - const void *v) + const void *v, struct kvm_io_ext_data *ext_data) { if (vcpu-arch.apic - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v)) + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data)) return 0; - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v); + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, + addr, len, v, ext_data); } static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + struct kvm_io_ext_data ext_data; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3825,7 +3827,7 @@ mmio: /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; @@ -3940,6 +3942,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) { /* TODO: String I/O for in kernel device */ int r; + struct kvm_io_ext_data ext_data; if (vcpu-arch.pio.in) r = kvm_io_bus_read(vcpu-kvm, KVM_PIO_BUS, vcpu-arch.pio.port, @@ -3947,7 +3950,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) else r = kvm_io_bus_write(vcpu-kvm, KVM_PIO_BUS, vcpu-arch.pio.port, vcpu-arch.pio.size, -pd); +pd, ext_data); return r; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7d313e0..a32c53e 100644 ---
[PATCH 0/4 v11] MSI-X MMIO support for KVM
Change from v9: Update according to the comments of Alex and Michael. Notice this patchset still based on 2.6.37 due to a block bug on assigned device in the upstream now. Sheng Yang (4): KVM: Move struct kvm_io_device to kvm_host.h KVM: Add kvm_io_ext_data to IO handler KVM: Emulate MSI-X table in kernel KVM: Add documents for MSI-X MMIO API Documentation/kvm/api.txt | 58 arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/i8254.c|6 +- arch/x86/kvm/i8259.c|3 +- arch/x86/kvm/lapic.c|3 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 51 +-- include/linux/kvm.h | 28 include/linux/kvm_host.h| 66 +- virt/kvm/assigned-dev.c | 44 ++ virt/kvm/coalesced_mmio.c |3 +- virt/kvm/eventfd.c |2 +- virt/kvm/ioapic.c |2 +- virt/kvm/iodev.h| 31 + virt/kvm/kvm_main.c | 40 +- virt/kvm/msix_mmio.c| 301 +++ virt/kvm/msix_mmio.h| 25 18 files changed, 616 insertions(+), 52 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: Emulate MSI-X table in kernel
Then we can support mask bit operation of assigned devices now. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 40 - include/linux/kvm.h | 28 include/linux/kvm_host.h| 36 + virt/kvm/assigned-dev.c | 44 ++ virt/kvm/kvm_main.c | 38 +- virt/kvm/msix_mmio.c| 301 +++ virt/kvm/msix_mmio.h| 25 10 files changed, 504 insertions(+), 13 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,7 @@ enum emulation_result { EMULATE_DONE, /* no further processing */ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ EMULATE_FAIL, /* can't emulate this instruction */ + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */ }; #define EMULTYPE_NO_DECODE (1 0) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9cafbb4..912dca4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; /* fall through */ + case EMULATE_USERSPACE_EXIT: + /* fall through */ case EMULATE_FAIL: return 0; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21b84e2..87308eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, { gpa_t gpa; struct kvm_io_ext_data ext_data; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) + if (!r) return X86EMUL_CONTINUE; - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-run-mmio.len = vcpu-mmio_size = bytes; - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; - memcpy(vcpu-run-mmio.data, val, bytes); + if (r == -ENOTSYNC) { + vcpu-userspace_exit_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE; + vcpu-run-msix_routing.dev_id = + ext_data.msix_routing.dev_id; + vcpu-run-msix_routing.type = + ext_data.msix_routing.type; + vcpu-run-msix_routing.entry_idx = + ext_data.msix_routing.entry_idx; + vcpu-run-msix_routing.flags = + ext_data.msix_routing.flags; + } else { + vcpu-mmio_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; + vcpu-run-mmio.len = vcpu-mmio_size = bytes; + vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; + memcpy(vcpu-run-mmio.data, val, bytes); + } return X86EMUL_CONTINUE; } @@ -4469,6 +4485,8 @@ done: r = EMULATE_DO_MMIO; } else if (r == EMULATION_RESTART) goto restart; + else if (vcpu-userspace_exit_needed) + r = EMULATE_USERSPACE_EXIT; else r = EMULATE_DONE; @@ -5397,12 +5415,18 @@ int
[PATCH 4/4] KVM: Add documents for MSI-X MMIO API
Signed-off-by: Sheng Yang sh...@linux.intel.com --- Documentation/kvm/api.txt | 58 + 1 files changed, 58 insertions(+), 0 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index e1a9297..dd10c3b 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -1263,6 +1263,53 @@ struct kvm_assigned_msix_entry { __u16 padding[3]; }; +4.54 KVM_REGISTER_MSIX_MMIO + +Capability: KVM_CAP_MSIX_MMIO +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_msix_mmio_user (in) +Returns: 0 on success, -1 on error + +This API indicates an MSI-X MMIO address of a guest device. Then all MMIO +operation would be handled by kernel. When necessary(e.g. MSI data/address +changed), KVM would exit to userspace using KVM_EXIT_MSIX_ROUTING_UPDATE to +indicate the MMIO modification and require userspace to update IRQ routing +table. + +NOTICE: Writing the MSI-X MMIO page after it was registered with this API may +be dangerous for userspace program. The writing during VM running may result +in synchronization issue therefore the assigned device can't work properly. +The writing is allowed when VM is not running and can be used as save/restore +mechanism. + +struct kvm_msix_mmio_user { + __u32 dev_id; + __u16 type; /* Device type and MMIO address type */ + __u16 max_entries_nr; /* Maximum entries supported */ + __u64 base_addr;/* Guest physical address of MMIO */ + __u64 base_va; /* Host virtual address of MMIO mapping */ + __u64 flags;/* Reserved for now */ + __u64 reserved[4]; +}; + +Current device type can be: +#define KVM_MSIX_MMIO_TYPE_ASSIGNED_DEV(1 0) + +Current MMIO type can be: +#define KVM_MSIX_MMIO_TYPE_BASE_TABLE (1 8) + +4.55 KVM_UNREGISTER_MSIX_MMIO + +Capability: KVM_CAP_MSIX_MMIO +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_msix_mmio_user (in) +Returns: 0 on success, -1 on error + +This API would unregister the specific MSI-X MMIO, indicated by dev_id and +type fields of struct kvm_msix_mmio_user. + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by @@ -1445,6 +1492,17 @@ Userspace can now handle the hypercall and when it's done modify the gprs as necessary. Upon guest entry all guest GPRs will then be replaced by the values in this struct. + /* KVM_EXIT_MSIX_ROUTING_UPDATE*/ + struct { + __u32 dev_id; + __u16 type; + __u16 entry_idx; + __u64 flags; + } msix_routing; + +KVM_EXIT_MSIX_ROUTING_UPDATE indicates one MSI-X entry has been modified, and +userspace need to update the correlated routing table. + /* Fix the size of the union. */ char padding[256]; }; -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] KVM: Move struct kvm_io_device to kvm_host.h
Then it can be used by other struct in kvm_host.h Signed-off-by: Sheng Yang sh...@linux.intel.com --- include/linux/kvm_host.h | 23 +++ virt/kvm/iodev.h | 25 + 2 files changed, 24 insertions(+), 24 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b5021db..7d313e0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -98,6 +98,29 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif +struct kvm_io_device; + +/** + * kvm_io_device_ops are called under kvm slots_lock. + * read and write handlers return 0 if the transaction has been handled, + * or non-zero to have it passed to the next device. + **/ +struct kvm_io_device_ops { + int (*read)(struct kvm_io_device *this, + gpa_t addr, + int len, + void *val); + int (*write)(struct kvm_io_device *this, +gpa_t addr, +int len, +const void *val); + void (*destructor)(struct kvm_io_device *this); +}; + +struct kvm_io_device { + const struct kvm_io_device_ops *ops; +}; + struct kvm_vcpu { struct kvm *kvm; #ifdef CONFIG_PREEMPT_NOTIFIERS diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h index 12fd3ca..d1f5651 100644 --- a/virt/kvm/iodev.h +++ b/virt/kvm/iodev.h @@ -17,32 +17,9 @@ #define __KVM_IODEV_H__ #include linux/kvm_types.h +#include linux/kvm_host.h #include asm/errno.h -struct kvm_io_device; - -/** - * kvm_io_device_ops are called under kvm slots_lock. - * read and write handlers return 0 if the transaction has been handled, - * or non-zero to have it passed to the next device. - **/ -struct kvm_io_device_ops { - int (*read)(struct kvm_io_device *this, - gpa_t addr, - int len, - void *val); - int (*write)(struct kvm_io_device *this, -gpa_t addr, -int len, -const void *val); - void (*destructor)(struct kvm_io_device *this); -}; - - -struct kvm_io_device { - const struct kvm_io_device_ops *ops; -}; - static inline void kvm_iodevice_init(struct kvm_io_device *dev, const struct kvm_io_device_ops *ops) { -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net
On Mon, Feb 28, 2011 at 12:04:27PM +0530, Krishna Kumar wrote: This patch series is a continuation of an earlier one that implemented guest MQ TX functionality. This new patchset implements both RX and TX MQ. Qemu changes are not being included at this time solely to aid in easier review. Compatibility testing with old/new combinations of qemu/guest and vhost was done without any issues. Some early TCP/UDP test results are at the bottom of this post, I plan to submit more test results in the coming days. Please review and provide feedback on what can improve. Thanks! Signed-off-by: Krishna Kumar krkum...@in.ibm.com To help testing, could you post the qemu changes separately please? --- Test configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Each test case runs for 60 secs, results below are average over two runs. Bandwidth numbers are in gbps. I have used default netperf, and no testing/system tuning other than taskset each vhost to 0xf (cpus 0-3). Comparison is testing original kernel vs new kernel with #txqs=8 (# refers to number of netperf sessions). ___ TCP: Guest - Local Host (TCP_STREAM) #BW1BW2 (%) SD1SD2 (%) RSD1RSD2 (%) ___ 17190 7170 (-.2) 0 0 (0) 3 4 (33.3) 28774 11235 (28.0)3 3 (0) 16 14 (-12.5) 49753 15195 (55.7)17 21 (23.5) 65 59 (-9.2) 810224 18265 (78.6)71 115 (61.9) 251 240 (-4.3) 16 10749 18123 (68.6)277456 (64.6) 985 925 (-6.0) 32 11133 17351 (55.8)1132 1947 (71.9)39353831 (-2.6) 64 11223 17115 (52.4)4682 7836 (67.3)15949 15373 (-3.6) 128 11269 16986 (50.7)19783 31505 (59.2) 66799 61759 (-7.5) ___ Summary: BW: 37.6 SD: 61.2 RSD: -6.5 ___ TCP: Local Host - Guest (TCP_MAERTS) #BW1BW2 (%)SD1SD2 (%)RSD1RSD2 (%) ___ 111490 10870 (-5.3) 0 0 (0) 2 2 (0) 210612 10554 (-.5)2 3 (50.0) 12 12 (0) 410047 14320 (42.5) 13 16 (23.0) 53 53 (0) 89273 15182 (63.7) 56 84 (50.0) 228 233 (2.1) 16 9291 15853 (70.6) 235390 (65.9) 934 965 (3.3) 32 9382 15741 (67.7) 9691823 (88.1)38684037 (4.3) 64 9270 14585 (57.3) 3966 8836 (122.7) 15415 17818 (15.5) 128 8997 14678 (63.1) 17024 36649 (115.2) 64933 72677 (11.9) ___ SUM: BW: 24.8 SD: 114.6 RSD: 12.1 __ UDP: Local Host - Guest (UDP_STREAM) # BW1 BW2 (%)SD1SD2 (%) __ 1 1723616585 (-3.7)1 1 (0) 2 1679522693 (35.1)5 6 (20.0) 4 1339021502 (60.5)37 36 (-2.7) 8 1326124361 (83.7)163175 (7.3) 16 1277223796 (86.3)692826 (19.3) 32 1283223880 (86.0)2812 2871 (2.0) 64 1277924293 (90.1)11299 11237 (-.5) 1281300624857 (91.1)44778 43884 (-1.9) __ Summary: BW: 37.1 SD: -1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V6 3/4] qmp, nmi: convert do_inject_nmi() to QObject
Anthony Liguori anth...@codemonkey.ws writes: On 02/25/2011 03:54 AM, Markus Armbruster wrote: Anthony Liguorialigu...@linux.vnet.ibm.com writes: On 02/24/2011 10:20 AM, Markus Armbruster wrote: Anthony Liguorialigu...@linux.vnet.ibm.com writes: On 02/24/2011 02:33 AM, Markus Armbruster wrote: Anthony Liguorianth...@codemonkey.wswrites: [...] Please describe all expected errors. Quoting qmp-commands.hx: 3. Errors, in special, are not documented. Applications should NOT check for specific errors classes or data (it's strongly recommended to only check for the error key) Indeed, not a single error is documented there. This is intentional. Yeah, but we're not 0.14 anymore and for 0.15, we need to document errors. If you are suggesting I send a patch to remove that section, I'm more than happy to. Two separate issues here: 1. Are we ready to commit to the current design of errors, and 2. Is it fair to reject Lai's patch now because he doesn't document his errors. I'm not commenting on 1. here. Regarding 2.: rejecting a patch because it doesn't document an aspect that current master intentionally leaves undocumented is not how you treat contributors. At least not if you want any other than certified masochists who enjoy pain, and professionals who get adequately compensated for it. Lead by example, not by fiat. http://repo.or.cz/w/qemu/aliguori.git/blob/refs/heads/glib:/qmp-schema.json I am in the process of documenting the errors of every command. It's a royal pain but I'm going to document everything we have right now. It's actually the last bit of work I need to finish before sending QAPI out. So for new commands being added, it would be hugely helpful for the authors to document the errors such that I don't have to reverse engineer all of the possible error conditions. The moment this lands in master, you can begin to demand error descriptions from contributors. Until then, I'll NAK error descriptions in qmp-commands.txt. We left them undocumented there for good reasons: No, it was always a bad reason. Good documentation is necessary to build good commands. Errors are a huge part of the semantics of a command. We cannot properly assess a command unless it's behavior in error conditions is well defined. The decision to declare QMP stable was made at the KVM Forum after quite some deliberation. We explictly excluded errors. We didn't do that because errors are not worthy of specification (that would be silly). We did it because there was too much unhappiness about the current state of errors to close the debate on them at that time. Luiz mindfully documented that decision in qmp-commands.txt: 3. Errors, in special, are not documented. Applications should NOT check for specific errors classes or data (it's strongly recommended to only check for the error key) Nobody denies that errors need to be specified, and this clause needs to go. But I must insist on proper review. No sneaking in of error specification through the commit backdoor, please. Once we have an error design in place that has a reasonable hope to stand the test of time, and have errors documented for at least some of the commands here, we can start to require proper error documentation for new commands. But not now. I won't NAK non-normative error descriptions, say in commit messages, or in comments. And I won't object to you asking for them. But I feel you really shouldn't make it a condition for committing patches. Especially not for simple patches that have been on list for months. I'm strongly committed to making QMP a first class interface in QEMU for 0.15. I feel documentation is a crucial part to making that happen. I'm not asking for test cases even though that's something that we'll need for 0.15 because there's not enough infrastructure in master yet to do that in a reasonable way. I realize I'm likely going to have to write that test case and I'm happy to do that. But there's no reason that we shouldn't require thorough documentation for all new QMP commands moving forward including error semantics. This is a critical part of having a first class API and no additional infrastructure is needed in master to do this. Broken record time, again. Nobody denies that errors need to be specified. Fair goal for 0.15. But I must insist on proper review. No sneaking in of error specification through the commit backdoor, please. I won't NAK non-normative error descriptions, say in commit messages, or in comments. And I won't object to you asking for them. But I feel you really shouldn't make it a condition for committing patches. Especially not for simple patches that have been on list for months. -- To unsubscribe from this list: send the line