[PATCH] KVM: Fix srcu struct leakage
From: Jan Kiszka jan.kis...@siemens.com Clean up the srcu struct and refactor its release on early errors. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- virt/kvm/kvm_main.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4111a4b..c80a44a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void) r = -ENOMEM; kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); if (!kvm-memslots) - goto out_err; + goto out_err_nosrcu; if (init_srcu_struct(kvm-srcu)) - goto out_err; + goto out_err_nosrcu; for (i = 0; i KVM_NR_BUSES; i++) { kvm-buses[i] = kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL); - if (!kvm-buses[i]) { - cleanup_srcu_struct(kvm-srcu); + if (!kvm-buses[i]) goto out_err; - } } r = kvm_init_mmu_notifier(kvm); - if (r) { - cleanup_srcu_struct(kvm-srcu); + if (r) goto out_err; - } kvm-mm = current-mm; atomic_inc(kvm-mm-mm_count); @@ -435,6 +431,8 @@ out: return kvm; out_err: + cleanup_srcu_struct(kvm-srcu); +out_err_nosrcu: hardware_disable_all(); out_err_nodisable: for (i = 0; i KVM_NR_BUSES; i++) @@ -516,6 +514,7 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_arch_destroy_vm(kvm); hardware_disable_all(); mmdrop(mm); + cleanup_srcu_struct(kvm-srcu); } void kvm_get_kvm(struct kvm *kvm) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pci-assign: Catch missing KVM support
From: Jan Kiszka jan.kis...@siemens.com Report an error instead of raising a SEGV when a pci-assign device is about to be initialized without KVM enabled. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index bde231d..5f5bde1 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1440,6 +1440,11 @@ static int assigned_initfn(struct PCIDevice *pci_dev) uint8_t e_device, e_intx; int r; +if (!kvm_enabled()) { +error_report(pci-assign: error: requires KVM support); +return -1; +} + if (!dev-host.seg !dev-host.bus !dev-host.dev !dev-host.func) { error_report(pci-assign: error: no host device specified); return -1; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk I/O stuck with KVM - no clue how to solve that
Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi: On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote: I experience strange disk I/O stucks on my Linux Host + Guest with KVM, which make the system (especially the guests) almost unusable. These stucks come periodically, e.g. every 2 to 10 seconds and last between 3 and sometimes over 120 seconds, which trigger kernel messages like this (on host and/or guest): INFO: task postgres:2195 blocked for more than 120 seconds The fact that this happens on the host too suggests there's an issue with the host software/hardware and the VM is triggering it but not the root cause. Does dmesg display any other suspicious messages? No, there's anything that can be seen via dmesg. I at first suspected the hardware, too. I can think of the following reasons: 1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the problem is not bound to a specific device, instead it happens on sda - sdd, so I doubt it's harddisk related. 2) Broken Power Supply / Insufficient Power - I'd expect either a complete crash or some error messages in this case, so I'd rather rule that out. 3) Broken SATA-Controller - I cannot think of any way to check that, but I'd also expect some crashes or kernel messages. I flashed the board to the latest BIOS version, no change either. However, it seems no one except me seems to have this problem, so I'll buy a new, similar but different mainboard (Intel instead of Asus), hopefully this solves the problem. What do you think, any better idea? Anyway, thanks for your reply! Best Regards, Hermann -- herm...@qwer.tk GPG key ID: 299893C7 (on keyservers) FP: 0124 2584 8809 EF2A DBF9 4902 64B4 D16B 2998 93C7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: Fix srcu struct leakage
From: Jan Kiszka jan.kis...@siemens.com Clean up the srcu struct on vm destruction and refactor its release on early errors. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- This one is better: kvm_arch_destroy_vm releases the kvm object, so we need to free the struct earlier. virt/kvm/kvm_main.c | 15 +++ 1 files changed, 7 insertions(+), 8 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4111a4b..6cfcde7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void) r = -ENOMEM; kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); if (!kvm-memslots) - goto out_err; + goto out_err_nosrcu; if (init_srcu_struct(kvm-srcu)) - goto out_err; + goto out_err_nosrcu; for (i = 0; i KVM_NR_BUSES; i++) { kvm-buses[i] = kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL); - if (!kvm-buses[i]) { - cleanup_srcu_struct(kvm-srcu); + if (!kvm-buses[i]) goto out_err; - } } r = kvm_init_mmu_notifier(kvm); - if (r) { - cleanup_srcu_struct(kvm-srcu); + if (r) goto out_err; - } kvm-mm = current-mm; atomic_inc(kvm-mm-mm_count); @@ -435,6 +431,8 @@ out: return kvm; out_err: + cleanup_srcu_struct(kvm-srcu); +out_err_nosrcu: hardware_disable_all(); out_err_nodisable: for (i = 0; i KVM_NR_BUSES; i++) @@ -513,6 +511,7 @@ static void kvm_destroy_vm(struct kvm *kvm) #else kvm_arch_flush_shadow(kvm); #endif + cleanup_srcu_struct(kvm-srcu); kvm_arch_destroy_vm(kvm); hardware_disable_all(); mmdrop(mm); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH] virtio-net: init link state correctly
On Fri, 5 Nov 2010 08:17:18 pm Jason Wang wrote: For device that supports VIRTIO_NET_F_STATUS, there's no need to assume the link is up and we need to call nerif_carrier_off() before querying device status, otherwise we may get wrong operstate after diver was loaded because the link watch event was not fired as expected. For device that does not support VIRITO_NET_F_STATUS, we could not get its status through virtnet_update_status() and what we can only do is always assuming the link is up. Acked-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Jason Wang jasow...@redhat.com Acked-by: Rusty Russell ru...@rustcorp.com.au Thanks! Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)
On Saturday 06 November 2010 08:27:15 Marcelo Tosatti wrote: On Thu, Nov 04, 2010 at 02:15:16PM +0800, Sheng Yang wrote: Here is the latest series of MSI-X mask supporting patches. The bigest change from last version is, in order to reduce the complexity, I moved all mask bit operation to kernel, including disabled entries. This addressed two concerns: 1. KVM and QEmu each own a part of mask bit operation. 2. QEmu need accessing the real hardware to get the mask bit information. So now QEmu have to use kernel API to get mask bit information. Though it would be slower than direct accessing the real hardware's MMIO, the direct accessing is unacceptable beacuse in fact the host OS own the device. The host OS can access the device without notifying the guest(and don't need to do so). Userspace shouldn't penetrate the host OS layer to directly operate the real hardware, otherwise would cause guest confusion. Also I discard the userspace mask operation as well, after showed the performance number. This version also removed the capability enabling mechanism. Because we want to use the struct kvm_assigned_msix_entry with new IOCTL, so there is no compatible issue. Please review. And I would run more test with current patch. So far so good. It would be good to know where the performance issue is with the entire implementation of mask bit in QEMU. http://www.mail-archive.com/kvm@vger.kernel.org/msg42652.html. All you mentioned in the past was there was high CPU utilization when running in QEMU and decided to start implementing in kernel. But AFAICS you did not really look into where the problem was... We've analysed the same issue in Xen, and believed it was caused by exiting to the QEmu everytime when guest want to access the mask bit(and some specific kernel want to do this desperately). So we provided the patches to Xen, and it worked very well. The cost of exiting to the userspace in KVM is much smaller than Xen side, but it still much slower than in-kernel. I've shown the performance data of kernel and userspace here: http://ns.spinics.net/lists/kvm/msg43712.html We can easily get 20% performance gain using kernel. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. Yes. The main usage of the interface is automated testing. The documentation should be very clear about the limitations of the interface and the intended use-case. Perhaps a { execute: 'enable-version-specific-commands', arguments: ['pfa2hva'] } ? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: MMU: don not retry #PF for nonpaging guest
On 11/05/2010 06:31 PM, Gleb Natapov wrote: On Fri, Nov 05, 2010 at 04:03:28PM +0800, Xiao Guangrong wrote: On 11/05/2010 03:45 PM, Gleb Natapov wrote: It looks like something broken: apfs can generated in L2 guest (nested ntp guest) and be retried in L1 guest. Why is this a problem? apf will be generate on direct map even when L2 guest is running so it should be OK to prefault it into direct map on completion. The nested_cr3 is different between L2 and L1, fix L2's page fault in L1's page table is useless. But we are fixing L0 page faults in L0 page table. We do not start apf because of L1 faulted in its page table. Hi Gleb, For example, NPT Guest L1 runs on Host, and Nested NPT Guest L2 runs on Guest L1. Now, Guest L2 is running, has below sequences: a: NPF/PF occurs in L2 Guest, and generates a apf(named A-apf), then L2 Guest is blocked b: a external event wakes up L2 Guest, and let it run again. c: L2 Guest VMEXIT to L1 Guest because L2 Guest's action is intercepted by Guest L1 d: When cpu enter L1 Guest, A-apf is completed, then it will retry A-apf in L1 Guest's mmu context, and this 'retry' is useless. Could you please point it out for me if i missed something. :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] Flow Control and Port Mirroring
On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote: [ CCed VHOST contacts ] On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote: On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman ho...@verge.net.au wrote: My reasoning is that in the non-mirroring case the guest is limited by the external interface through wich the packets eventually flow - that is 1Gbit/s. But in the mirrored either there is no flow control or the flow control is acting on the rate of dummy0, which is essentailly infinate. Before investigating this any further I wanted to ask if this behaviour is intentional. It's not intentional but I can take a guess at what is happening. When we send the packet to a mirror, the skb is cloned but only the original skb is charged to the sender. If the original packet is delivered to localhost then it will be freed quickly and no longer accounted for, despite the fact that the real packet is still sitting in the transmit queue on the NIC. The UDP stack will then send the next packet, limited only by the speed of the CPU. That would explain what I have observed. I can't find the thread (what is ovs-dev?), but I think the tap device has this fundamental feature: you can blast as many packets as you want through it. If that's a bad thing, we have to look harder... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote: On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote: +}; + +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X +mask bit in the kernel. What happens on repeated calls when it's already enabled? How does one disable after it's enabled? Suppose this should only be called once. But again would update the MMIO base. So maybe add this all to documentation. It would be disabled along with device deassignment. So what are you going to do when guest enables and later disables MSIX? Disable assignment completely? This device goes with PCI resources allocation, rather than IRQ assignment. I see. I guess we can live with it, but it seems tied to a specific mode of use. Would be better to support enabling together with msix too, this requires an ability to disable. We enable it explicitly because not all devices have MSI-X capability. + This is very specialized for assigned devices. I would like an interface not tied to assigned devices explicitly (even if the implementation at the moment only works for assigned devices, I can patch that in later). E.g. vhost-net would benefit from such an extension too. Maybe tie this to a special GSI and this GSI can be an assigned device or not? You're talking about KVM_ASSIGN_GET_MSIX_ENTRY? Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO and kvm_assigned_msix_mmio really. We can't tie it to GSI, because we should also cover entries without GSI. The entry number should be fine for all kinds of devices using MSI-X. I don't completely understand: entries without GSI never need to inject an interrupt. Why do we care about them? Let's pass such accesses to userspace. In any case, I think MSIX GSIs are inexpensive (we never search them all), so we could create GSIs for all entries if we wanted to. All mask bit accessing controlled by kernel, in order to keep consistent. Well I think passing whatever we can't handle to userspace makes complete sense. We do this for address/data writes after all. If we wanted to do it *all* in kernel we would handle address/data there too. I think Avi suggested this at some point. I think Avi's point was we handle mask bit half in the kernel and half in the userspace, which makes API complex. I agree with this so I moved all mask bit handling to kernel. Data/address is another issue, due to QEmu own the routing table. We can change it in the future, but this is not related to this patch. The main point is msix mmio BAR handling is completely unrelated to the assigned device. Emulated devices have an msix BAR as well. So tying it to an assigned devices is suboptimal, we'll need to grow yet another interface for these. If you only means the name, we can change that. But I think all MSI-X entries would have entry number, regardless of if it got GSI number. Keep entry number as parameter makes more sense to me. And still, the patch's only current user is assigned device. I am glad if it can cover the other future case along with it, but that's not the primary target of this patch. I am not sure about what PV one would looks like. I use kvm_assigned_msix_entry just because it can be easily reused. Not only PV, emulated devices too. OK let's try to figure out: What happens with userspace is, we inject interrupts either through ioctls or eventfds. Avoiding an exit to userspace on MSIX access is beneficial in exactly the same way. We could have an ioctl to pass in the table addresses, and mapping table to relevant GSIs. For example, add kvm_msix_mmio with table start/end, also specify which GSIs are covered. Maybe ask that userspace allocate all GSIs for a table consequitively? Or add kvm_irq_routing_msix which is same as msi but also has the mmio addresses? Bonus points for avoiding the need to scan all table on each write. For example, don't scan at all, instead just set a bit in KVM, and set irq entry on the next interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32 byte bitmap would be enough for this, avoding a linear scan. When we pass in kvm_msix_mmio, kvm would start registering masked state. +4.48 KVM_ASSIGN_GET_MSIX_ENTRY + +Capability: KVM_CAP_DEVICE_MSIX_MASK +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_assigned_msix_entry (in and out) +Returns: 0 on success, !0 on error + +struct kvm_assigned_msix_entry { + /* Assigned device's ID */ + __u32 assigned_dev_id; + /* Ignored */ +
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On 11/06/2010 11:24 AM, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. The command is inherently racy so if you tried to implement an automated unit test, you'd get false The documentation should be very clear about the limitations of the interface and the intended use-case. Perhaps a { execute: 'enable-version-specific-commands', arguments: ['pfa2hva'] } ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On 11/07/2010 07:29 PM, Huang Ying wrote: On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote: On 11/01/2010 03:22 PM, Anthony Liguori wrote: On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. Automated testing will want to use qmp. Yes. The main usage of the interface is automated testing. That's precisely what the command should not be used for. You can't assume a gpa - hva mapping is consistent in an external application. If you want to implement an interface for testing, you have to push more of the logic into QEMU to avoid the race. Regards, Anthony Liguori The documentation should be very clear about the limitations of the interface and the intended use-case. Perhaps a { execute: 'enable-version-specific-commands', arguments: ['pfa2hva'] } ? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ovs-dev] Flow Control and Port Mirroring
On Mon, Nov 08, 2010 at 01:41:23PM +1030, Rusty Russell wrote: On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote: [ CCed VHOST contacts ] On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote: On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman ho...@verge.net.au wrote: My reasoning is that in the non-mirroring case the guest is limited by the external interface through wich the packets eventually flow - that is 1Gbit/s. But in the mirrored either there is no flow control or the flow control is acting on the rate of dummy0, which is essentailly infinate. Before investigating this any further I wanted to ask if this behaviour is intentional. It's not intentional but I can take a guess at what is happening. When we send the packet to a mirror, the skb is cloned but only the original skb is charged to the sender. If the original packet is delivered to localhost then it will be freed quickly and no longer accounted for, despite the fact that the real packet is still sitting in the transmit queue on the NIC. The UDP stack will then send the next packet, limited only by the speed of the CPU. That would explain what I have observed. I can't find the thread (what is ovs-dev?), Sorry, yes its on ovs-dev. http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html but I think the tap device has this fundamental feature: you can blast as many packets as you want through it. If that's a bad thing, we have to look harder... There does seem to be flow control in the non-mirrored case. So I suspect its occurring at the skb level but that breaks down when a clone occurs. It would seem that fragment level flow control would help this problem (which is basically what Xen's netback/netfront has), but by this point I am speculating wildly. I'll try and find out exactly where the problem is occurring in order for us to have a more informed discussion. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] qemu-kvm: device assignment: emulate MSI-X mask bits
On Friday 05 November 2010 21:54:16 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 07:02:02PM +0800, Sheng Yang wrote: On Friday 05 November 2010 17:05:32 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 11:20:37AM +0800, Sheng Yang wrote: On Thursday 04 November 2010 17:44:27 Michael S. Tsirkin wrote: Thanks very much for reviewing this! Seems nobody cares about userspace one before... On Thu, Nov 04, 2010 at 02:18:21PM +0800, Sheng Yang wrote: This patch emulated MSI-X per vector mask bit on assigned device. Signed-off-by: Sheng Yang sh...@linux.intel.com --- hw/device-assignment.c | 161 ++-- 1 files changed, 155 insertions(+), 6 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 8a98876..639aa0b 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -62,6 +62,11 @@ static void assigned_dev_load_option_rom(AssignedDevice *dev); static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev); +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) +{ +return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; +} + static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, uint32_t addr, int len, uint32_t *val) { @@ -264,6 +269,9 @@ static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num, AssignedDevRegion *region = r_dev-v_addrs[region_num]; PCIRegion *real_region = r_dev-real_device.regions[region_num]; int ret = 0; +#ifdef KVM_CAP_DEVICE_MSIX_MASK +struct kvm_assigned_msix_mmio msix_mmio; +#endif DEBUG(e_phys=%08 FMT_PCIBUS r_virt=%p type=%d len=%08 FMT_PCIBUS region_num=%d \n, e_phys, region-u.r_virtbase, type, e_size, region_num); @@ -282,6 +290,16 @@ static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num, cpu_register_physical_memory(e_phys + offset, TARGET_PAGE_SIZE, r_dev-mmio_index); +#ifdef KVM_CAP_DEVICE_MSIX_MASK + memset(msix_mmio, 0, sizeof(struct kvm_assigned_msix_mmio)); + msix_mmio.assigned_dev_id = calc_assigned_dev_id(r_dev-h_segnr, + r_dev- h_busnr, r_dev-h_devfn); + msix_mmio.base_addr = e_phys + offset; +/* We required kernel MSI-X support */ + ret = kvm_assign_reg_msix_mmio(kvm_context, msix_mmio); + if (ret) +fprintf(stderr, fail to register in-kernel msix_mmio!\n); +#endif } } @@ -824,11 +842,6 @@ static void free_assigned_device(AssignedDevice *dev) } } -static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) -{ -return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; -} - static void assign_failed_examine(AssignedDevice *dev) { char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns; @@ -1123,6 +1136,27 @@ static int get_msix_entries_max_nr(AssignedDevice *adev) return entries_max_nr; } +#ifdef KVM_CAP_DEVICE_MSIX_MASK +static int assigned_dev_msix_entry_masked(AssignedDevice *adev, int entry) +{ +struct kvm_assigned_msix_entry msix_entry; +int r; + +memset(msix_entry, 0, sizeof msix_entry); +msix_entry.assigned_dev_id = calc_assigned_dev_id(adev-h_segnr, +adev-h_busnr, (uint8_t)adev-h_devfn); +msix_entry.entry = entry; +msix_entry.flags = KVM_MSIX_FLAG_QUERY_MASK; +r = kvm_assign_get_msix_entry(kvm_context, msix_entry); +if (r) { +fprintf(stderr, assigned_dev_msix_entry_masked: + Fail to get mask bit of entry %d\n, entry); +return 1; This error handling seems pretty useless. assert? Maybe we can continue with it. Assert seems a little strict. Well need to consider whether to return 1 or 0 then. Tell it it's masked then routing change may be continue. OK, then 1. we probably want to print this only once, not on each write 2. only try doing this if capability is present. +} +return (msix_entry.flags KVM_MSIX_FLAG_MASK); +} +#endif +
Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
On Saturday 06 November 2010 08:24:51 Marcelo Tosatti wrote: On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote: On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote: +}; + +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X +mask bit in the kernel. What happens on repeated calls when it's already enabled? How does one disable after it's enabled? Suppose this should only be called once. But again would update the MMIO base. So maybe add this all to documentation. It would be disabled along with device deassignment. So what are you going to do when guest enables and later disables MSIX? Disable assignment completely? This device goes with PCI resources allocation, rather than IRQ assignment. What about hot plug, for example? I can't see how this can function without unregistering regions. The register time is when PCI resources allocation, and the unregister time is when device was deassigned. I suppose hot plug path would also notify kernel to deassign the device? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] qemu-kvm: device assignment: emulate MSI-X mask bits
On Mon, Nov 08, 2010 at 01:17:47PM +0800, Sheng Yang wrote: @@ -1250,10 +1297,16 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, int enable_msix) fprintf(stderr, assigned_dev_update_msix: MSI-X entries_max_nr == 0); return; } +/* + * Guest may try to enable MSI-X before setting MSI-X entry done, so + * let's wait until guest unmask the entries. + */ Well it can also set up any number of entries, enable msix then set up more entries. Now what? It's the same. If it want to set up more entries, it have to unmask them. Then we would get them. Why can't we handle the first enable in the same way? Don't understand the question, but I guess the answer is pci_enable_msix(). pci_enable_msix is an internal kernel API to enable msix, isn't it? We are discussing qemu patch here. OK, I understand the question now. The others entries would be enabled with unmask, but not very first ones. Guest can unmask the entries first, then enable MSI-X. We should scan the table when MSI-X enabled, that's why the first one is different from others. I see. I think the comment mislead me: you really mean 'Guest can unmask MSI-X entries when MSI-X is disabled, don't do anything until MSI-X is enabled'. entries_max_nr); if (entries_nr == 0) { +#ifndef KVM_CAP_DEVICE_MSIX_MASK if (enable_msix) fprintf(stderr, MSI-X entry number is zero!\n); And what happens then? MSI-X can't work for old ones without MSIX mask support. Old ones? Old userspace without KVM_CAP_DEVICE_MSIX_MASK Reload the guest module may help. Guest might not have any concept of modules. Just an workaround. Not a suggestion method. Look, if there's some failure mode we need a way to recover, or to make it very unlikely. If this is guest doing something illegal let's add a comment explaining what it is and why it's illegal. If this is legal, why the print out? It's unsupported in the old QEmu. I think I have it. The comment you want is: /* Error: number of MSI-X entries is zero. Using this device requires KVM_CAP_DEVICE_MSIX_MASK support in kvm which is missing. */ As I said prebiously, we must check runtime capabilities for this, ifdef is not enough to know kernel supports a feature because there's no dependency between kernel and qemu packages. I can change the comment, but I do think it's good to let user know something important is wrong. stderr is really there for developers. You can't expect the user to understand this message. Maybe it's better to put it into DEBUG. -- regards Yang, Sheng -- regards Yang, Sheng -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk I/O stuck with KVM - no clue how to solve that
On Sun, Nov 7, 2010 at 4:07 PM, Hermann Himmelbauer du...@qwer.tk wrote: Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi: On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote: I experience strange disk I/O stucks on my Linux Host + Guest with KVM, which make the system (especially the guests) almost unusable. These stucks come periodically, e.g. every 2 to 10 seconds and last between 3 and sometimes over 120 seconds, which trigger kernel messages like this (on host and/or guest): INFO: task postgres:2195 blocked for more than 120 seconds The fact that this happens on the host too suggests there's an issue with the host software/hardware and the VM is triggering it but not the root cause. Does dmesg display any other suspicious messages? No, there's anything that can be seen via dmesg. I at first suspected the hardware, too. I can think of the following reasons: 1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the problem is not bound to a specific device, instead it happens on sda - sdd, so I doubt it's harddisk related. 2) Broken Power Supply / Insufficient Power - I'd expect either a complete crash or some error messages in this case, so I'd rather rule that out. 3) Broken SATA-Controller - I cannot think of any way to check that, but I'd also expect some crashes or kernel messages. I flashed the board to the latest BIOS version, no change either. However, it seems no one except me seems to have this problem, so I'll buy a new, similar but different mainboard (Intel instead of Asus), hopefully this solves the problem. What do you think, any better idea? If you have the time, you can use perf probes to trace I/O requests in the host kernel. Perhaps completion interrupts are being dropped. You may wish to start by tracing requests issued and completed by the SATA driver. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
On Mon, Nov 08, 2010 at 11:18:34AM +0800, Sheng Yang wrote: On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote: On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote: On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote: +}; + +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X +mask bit in the kernel. What happens on repeated calls when it's already enabled? How does one disable after it's enabled? Suppose this should only be called once. But again would update the MMIO base. So maybe add this all to documentation. It would be disabled along with device deassignment. So what are you going to do when guest enables and later disables MSIX? Disable assignment completely? This device goes with PCI resources allocation, rather than IRQ assignment. I see. I guess we can live with it, but it seems tied to a specific mode of use. Would be better to support enabling together with msix too, this requires an ability to disable. We enable it explicitly because not all devices have MSI-X capability. + This is very specialized for assigned devices. I would like an interface not tied to assigned devices explicitly (even if the implementation at the moment only works for assigned devices, I can patch that in later). E.g. vhost-net would benefit from such an extension too. Maybe tie this to a special GSI and this GSI can be an assigned device or not? You're talking about KVM_ASSIGN_GET_MSIX_ENTRY? Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO and kvm_assigned_msix_mmio really. We can't tie it to GSI, because we should also cover entries without GSI. The entry number should be fine for all kinds of devices using MSI-X. I don't completely understand: entries without GSI never need to inject an interrupt. Why do we care about them? Let's pass such accesses to userspace. In any case, I think MSIX GSIs are inexpensive (we never search them all), so we could create GSIs for all entries if we wanted to. All mask bit accessing controlled by kernel, in order to keep consistent. Well I think passing whatever we can't handle to userspace makes complete sense. We do this for address/data writes after all. If we wanted to do it *all* in kernel we would handle address/data there too. I think Avi suggested this at some point. I think Avi's point was we handle mask bit half in the kernel and half in the userspace, which makes API complex. I agree with this so I moved all mask bit handling to kernel. Data/address is another issue, due to QEmu own the routing table. We can change it in the future, but this is not related to this patch. The main point is msix mmio BAR handling is completely unrelated to the assigned device. Emulated devices have an msix BAR as well. So tying it to an assigned devices is suboptimal, we'll need to grow yet another interface for these. If you only means the name, we can change that. I mean it shouldn't use the assigned device id. Use some other index: I suggested adding a new GSI type for this, but can be something else if you prefer. But I think all MSI-X entries would have entry number, regardless of if it got GSI number. Keep entry number as parameter makes more sense to me. And still, the patch's only current user is assigned device. I am glad if it can cover the other future case along with it, but that's not the primary target of this patch. It's a problem I think. Stop thinking about the patch for a moment and think about kvm as a whole. We know we will want such an interface for emulated and PV devices. What then? Add another interface and a bunch of code to support it? Point being, since userspace interfaces need to be supported forever, they should be generic if possible. I am not sure about what PV one would looks like. I use kvm_assigned_msix_entry just because it can be easily reused. Not only PV, emulated devices too. OK let's try to figure out: What happens with userspace is, we inject interrupts either through ioctls or eventfds. Avoiding an exit to userspace on MSIX access is beneficial in exactly the same way. We could have an ioctl to pass in the table addresses, and mapping table to relevant GSIs. For example, add kvm_msix_mmio with table start/end, also specify which GSIs are covered. Maybe ask that userspace allocate all GSIs for a table consequitively? Or add kvm_irq_routing_msix which is same as msi but also has the mmio addresses? Bonus points for avoiding the need to scan all table on each write. For example, don't scan at all, instead just
Re: [Qemu-devel] [PATCH] virtio-9p: fix build on !CONFIG_UTIMENSAT v2
This patch introduce a fallback mechanism for old systems that do not support utimensat. This will fix build failure with following warnings: hw/virtio-9p-local.c: In function 'local_utimensat': hw/virtio-9p-local.c:479: warning: implicit declaration of function 'utimensat' hw/virtio-9p-local.c:479: warning: nested extern declaration of 'utimensat' and hw/virtio-9p.c: In function 'v9fs_setattr_post_chmod': hw/virtio-9p.c:1410: error: 'UTIME_NOW' undeclared (first use in this function) hw/virtio-9p.c:1410: error: (Each undeclared identifier is reported only once hw/virtio-9p.c:1410: error: for each function it appears in.) hw/virtio-9p.c:1413: error: 'UTIME_OMIT' undeclared (first use in this function) hw/virtio-9p.c: In function 'v9fs_wstat_post_chmod': hw/virtio-9p.c:2905: error: 'UTIME_OMIT' undeclared (first use in this function) Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com --- hw/virtio-9p-local.c | 32 ++-- hw/virtio-9p.h | 10 ++ 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/hw/virtio-9p-local.c b/hw/virtio-9p-local.c index 0d52020..7811d2c 100644 --- a/hw/virtio-9p-local.c +++ b/hw/virtio-9p-local.c @@ -479,10 +479,38 @@ static int local_chown(FsContext *fs_ctx, const char *path, FsCred *credp) return -1; } +/* TODO: relocate this to proper file, and make it more generic */ +static int qemu_utimensat(int dirfd, const char *path, + const struct timespec *times, int flags) +{ IMHO, this code can be moved to cutils.c +#ifdef CONFIG_UTIMENSAT +return utimensat(dirfd, path, times, flags); +#else +/* + * Fallback: use utimes() instead of utimensat(). + * See commit 74bc02b2d2272dc88fb98d43e631eb154717f517 for known problem. + */ +struct timeval tv[2]; +int i; + +for (i = 0; i 2; i++) { +if (times[i].tv_nsec == UTIME_OMIT || times[i].tv_nsec == UTIME_NOW) { +tv[i].tv_sec = 0; +tv[i].tv_usec = 0; +} else { +tv[i].tv_sec = times[i].tv_sec; +tv[i].tv_usec = times[i].tv_nsec / 1000; +} +} + +return utimes(path, tv[0]); +#endif The idea of introducing utimensat was to avoid resetting atime to 1970-01-01 05:30:00 (utime does not give option to not change atime). But as per utimes man page, if any of the time field is 0, it would be set to current time. As per stat man page, truncate will not update atime, only mtime will be updated. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... Something like : diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e6ba898..2dca504 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -195,6 +195,9 @@ struct skb_shared_info { __be32 ip6_frag_id; __u8tx_flags; struct sk_buff *frag_list; + /* Intermediate layers must ensure that destructor_arg + * remains valid until skb destructor */ + void*destructor_arg; struct skb_shared_hwtstamps hwtstamps; /* @@ -202,9 +205,6 @@ struct skb_shared_info { */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg - * remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; Will that affect the cache line? Or, we can move the line to clear destructor_arg to the end of __alloc_skb(). It looks like as the following, which one do you prefer? Thanks Xiaohui --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..df852f2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, child-fclone = SKB_FCLONE_UNAVAILABLE; } + shinfo-destructor_arg = NULL; out: return skb; nodata: @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
I have addressed this issue in v14 patch set. Thanks Xiaohui -Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Saturday, October 30, 2010 4:29 AM To: Xin, Xiaohui Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; jd...@linux.intel.com Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver. From: Xin, Xiaohui xiaohui@intel.com Date: Wed, 27 Oct 2010 09:33:12 +0800 Somehow, it seems not a trivial work to support it now. Can we support it later and as a todo with our current work? I would prefer the feature work properly, rather than only in specific cases, before being integated. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html