[PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking
To support domain-isolation usages, the platform hardware must be capable of uniquely identifying the requestor (source-id) for each interrupt message. Without source-id checking for interrupt remapping , a rouge guest/VM with assigned devices can launch interrupt attacks to bring down anothe guest/VM or the VMM itself. This patch adds source-id checking for interrupt remapping, and then really isolates interrupts for guests/VMs with assigned devices. Signed-off-by: Weidong Han weidong@intel.com --- arch/x86/kernel/apic/io_apic.c |6 +++ drivers/pci/intr_remapping.c | 98 drivers/pci/intr_remapping.h |2 + include/linux/dmar.h | 11 + 4 files changed, 117 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 30da617..3d10c68 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1559,6 +1559,9 @@ int setup_ioapic_entry(int apic_id, int irq, irte.vector = vector; irte.dest_id = IRTE_DEST(destination); + /* Set source-id of interrupt request */ + set_ioapic_sid(irte, apic_id); + modify_irte(irq, irte); ir_entry-index2 = (index 15) 0x1; @@ -3329,6 +3332,9 @@ static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_ms irte.vector = cfg-vector; irte.dest_id = IRTE_DEST(dest); + /* Set source-id of interrupt request */ + set_msi_sid(irte, pdev); + modify_irte(irq, irte); msg-address_hi = MSI_ADDR_BASE_HI; diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c index 946e170..825bca2 100644 --- a/drivers/pci/intr_remapping.c +++ b/drivers/pci/intr_remapping.c @@ -10,6 +10,7 @@ #include linux/intel-iommu.h #include intr_remapping.h #include acpi/acpi.h +#include pci.h static struct ioapic_scope ir_ioapic[MAX_IO_APICS]; static int ir_ioapic_num; @@ -405,6 +406,61 @@ int free_irte(int irq) return rc; } +int set_ioapic_sid(struct irte *irte, int apic) +{ + int i; + u16 sid = 0; + + if (!irte) + return -1; + + for (i = 0; i MAX_IO_APICS; i++) + if (ir_ioapic[i].id == apic) { + sid = (ir_ioapic[i].bus 8) | ir_ioapic[i].devfn; + break; + } + + if (sid == 0) { + printk(KERN_WARNING Failed to set source-id of + I/O APIC (%d), because it is not under + any DRHD\n, apic); + return -1; + } + + irte-svt = 1; /* requestor ID verification SID/SQ */ + irte-sq = 0; /* comparing all 16-bit of SID */ + irte-sid = sid; + + return 0; +} + +int set_msi_sid(struct irte *irte, struct pci_dev *dev) +{ + struct pci_dev *tmp; + + if (!irte || !dev) + return -1; + + tmp = pci_find_upstream_pcie_bridge(dev); + if (!tmp) { /* PCIE device or integrated PCI device */ + irte-svt = 1; /* verify requestor ID verification SID/SQ */ + irte-sq = 0; /* comparing all 16-bit of SID */ + irte-sid = (dev-bus-number 8) | dev-devfn; + return 0; + } + + if (tmp-is_pcie) { /* this is a PCIE-to-PCI bridge */ + irte-svt = 2; /* verify request ID verification SID */ + irte-sid = (tmp-bus-number 8) | dev-bus-number; + } else { /* this is a legacy PCI bridge */ + irte-svt = 1; /* verify requestor ID verification SID/SQ */ + irte-sq = 0; /* comparing all 16-bit of SID */ + irte-sid = (tmp-bus-number 8) | tmp-devfn; + } + + return 0; +} + static void iommu_set_intr_remapping(struct intel_iommu *iommu, int mode) { u64 addr; @@ -616,6 +672,10 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header *header, struct acpi_dmar_hardware_unit *drhd; struct acpi_dmar_device_scope *scope; void *start, *end; + struct acpi_dmar_pci_path *path; + struct pci_bus *bus; + struct pci_dev *pdev = NULL; + int count; drhd = (struct acpi_dmar_hardware_unit *)header; @@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header *header, 0x%Lx\n, scope-enumeration_id, drhd-address); + bus = pci_find_bus(drhd-segment, scope-bus); + path = (struct acpi_dmar_pci_path *)(scope + 1); + count = (scope-length - +sizeof(struct acpi_dmar_device_scope)) + / sizeof(struct acpi_dmar_pci_path); + + while (count) { + if (pdev) +
[PATCH 1/2] Intel-IOMMU, intr-remap: set the whole 128bits of irte when modify/free it
Interrupt remapping table entry is 128bits. Currently, it only sets low 64bits of irte in modify_irte and free_irte. This ignores high 64bits setting of irte, that means source-id setting will be ignored. This patch sets the whole 128bits of irte when modify/free it. Following source-id checking patch depends on this. Signed-off-by: Weidong Han weidong@intel.com --- drivers/pci/intr_remapping.c | 10 +++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c index f5e0ea7..946e170 100644 --- a/drivers/pci/intr_remapping.c +++ b/drivers/pci/intr_remapping.c @@ -309,7 +309,8 @@ int modify_irte(int irq, struct irte *irte_modified) index = irq_iommu-irte_index + irq_iommu-sub_handle; irte = iommu-ir_table-base[index]; - set_64bit((unsigned long *)irte, irte_modified-low); + set_64bit((unsigned long *)irte-low, irte_modified-low); + set_64bit((unsigned long *)irte-high, irte_modified-high); __iommu_flush_cache(iommu, irte, sizeof(*irte)); rc = qi_flush_iec(iommu, index, 0); @@ -386,8 +387,11 @@ int free_irte(int irq) irte = iommu-ir_table-base[index]; if (!irq_iommu-sub_handle) { - for (i = 0; i (1 irq_iommu-irte_mask); i++) - set_64bit((unsigned long *)(irte + i), 0); + for (i = 0; i (1 irq_iommu-irte_mask); i++) { + set_64bit((unsigned long *)irte-low, 0); + set_64bit((unsigned long *)irte-high, 0); + irte++; + } rc = qi_flush_iec(iommu, index, irq_iommu-irte_mask); } -- 1.6.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: tg3 driver in guest fails for VT-d passthrough NIC
Yuji virtualized more capabilities to guest in Xen Qemu. I think your tg3 driver can work in Xen guest with VT-d passthrough. If anyone interests, he/she can port the code from Xen Qemu to KVM Qemu. Regards, Weidong Alex Williamson wrote: On Wed, May 6, 2009 at 11:31 AM, Nadolski, Ed ed.nadol...@lsi.com wrote: Hi, I am running the KVM kernel userspace downloaded on 4/24 with Fedora 10 host Linux on a Dell T7400 Xeon with VT-x and VT-d enabled. The T7400 has an onboard Broadcom NIC, but when I use VT-d to assign this NIC to a Fedora 10 guest, the NIC's tg3 driver in the guest aborts because it cannot find the PM Capability in the device's PCI config space. ... So has anyone else seen this, and what is the right way to address it? It's not good to simply pass thru certain device Capabilities if they cannot be properly handled by KVM/QEMU, but OTOH the unmodified guest OS drivers should not break because they were written to expect certain Capabilities. Is there any plan to add support for these missing Capabilities? Yes, I've seen this with bnx2 devices. For a Linux guest, it's a fairly trivial change to the driver to not depend on these capabilities, but it would be far more compatible with existing driver code out there to make an attempt to support support them in KVM. AFAIK, it's not being working on yet. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Question about KVM and PC speaker
Sebastian Herbszt wrote: Jan Kiszka wrote: Sebastian Herbszt wrote: Simon Bienlein wrote: Is a support for BIOS worked on right now? The vgabios (vgabios.c) has a FIXME should beep. Volker, do you plan to fix this? Which frequency should be used for the beep? Which delay? I would try 1 KHz and some hundred milliseconds. I just looked at some vga bios and it uses about 896,45 Hz. Getting a delay using inb(0x61) 0x10 is still a no go on qemu, right? Looks like (should be far too inaccurate for longer delays). What about 0x40:0x6c, the BIOS' daily timer counter? The bios i looked at used the refresh request port 0x61. This is supported by bochs and there is also a patch for qemu to replace the dummy [1]. The rombios uses this to provide INT 15h AH=86h functionality; this is likely broken with the dummy code in qemu. I see no problem with improving qemu's emulation accuracy this way a bit. But I wouldn't built new delay implementations on top of it, specifically if the code is in fact aware of running over a hypervisor. Such micro-timings are far too inaccurate for longer delays in an environment where you cannot be sure of running all the time during that period. Anyway, using timer ticks since midnight should be possible (INT 08h handler is set up before vga bios is called). [1] http://article.gmane.org/gmane.comp.emulators.bochs.devel/7594 - Sebastian Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Zhang, Xiantao wrote: --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -4417,13 +4417,11 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB ^^^ Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined. Yes, I made get_set_top_cpu() a noop instead when compiled with CONFIG_GDBSTUB disabled, that way we avoid a bunch of these silly #ifdefs. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] msi-x: let drivers retry when not enough vectors
pci_enable_msix currently returns -EINVAL if you ask for more vectors than supported by the device, which would typically cause fallback to regular interrupts. It's better to return the table size, making the driver retry MSI-X with less vectors. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Hi Jesse, This came up when I was adding MSI-X support to virtio pci driver, which does not know the exact table size upfront. Could you consider this patch for 2.6.31 please? drivers/pci/msi.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 6f2e629..f5bd1c9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev) * indicates the successful configuration of MSI-X capability structure * with new allocated MSI-X irqs. A return of 0 indicates a failure. * Or a return of 0 indicates that driver request is exceeding the number - * of irqs available. Driver should use the returned value to re-send - * its request. + * of irqs or MSI-X vectors available. Driver should use the returned value to + * re-send its request. **/ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) { @@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) nr_entries = pci_msix_table_size(dev); if (nvec nr_entries) - return -EINVAL; + return nr_entries; /* Check for any invalid entries */ for (i = 0; i nvec; i++) { -- 1.6.3.rc3.1.g830204 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2506814 ] TAP network lockup after some traffic
Bugs item #2506814, was opened at 2009-01-14 11:38 Message generated for change (Comment added) made by mellen You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Fabio Coatti (cova) Assigned to: Nobody/Anonymous (nobody) Summary: TAP network lockup after some traffic Initial Comment: Hi all, we are experiencing severe network troubles using kvm+tap networking. basically, after some network load (we have yet to identify the exact amount of traffic if one exist) network stops working. During lockups, With tcpdump we can see arp requests on guest interface, then on tap, brX and physical interfaces on host system. the arp answers can be seen, with tcpdump, only on physical host interface and bridge (brX), but not on tap device. Basically it seems that packets coming from external network get lost in tap device on the way to guest (kvm). Looking at tap data with ifconfig, the only weird thing is the TX packets overrun count that is 0. every time the network stops working, overrun count increases. This has been observed with several kvm releases (for sure, 76/77/78/79/80/81/82) and with different kernels (tried with some versions among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side. we tried several network drivers (virtio, e1000, rtl) and all shows the same problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has acceptable performance) (btw: on host machine we have vlan on top of ethX devices) cpu number on guest makes no difference. we tried with vanilla kernel provided kvm modules and with kvm package provided modules. guest: 32 bit host: 64 bit host machine: 2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo. Of course I can provide more details or perform other tests and try patches, if someone can give me some hints and advices they will be most welcome. Thanks. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-07 10:39 Message: Just happened again. Seems like it stops generating interrupts on virtio devices: 10: 3001 51199 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 doing an /bin/ls froze the guest for a minute or so until a CTRL-C got through. After /bin/ls: 10: 3001 51220 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 kvm_stat: efer_reload0 0 exits 19150288617 1082 fpu_reload 11976730134 halt_exits 544721221 194 halt_wakeup300187634 141 host_state_reload 837279034 259 hypercalls4991133618 0 insn_emulation4797940911 709 insn_emulation_fail0 0 invlpg 0 0 io_exits 28125446265 irq_exits 173205458 0 irq_window 0 0 largepages 0 0 mmio_exits720457 0 mmu_cache_miss 290275800 0 mmu_flooded289036380 0 mmu_pde_zapped 170308104 0 mmu_pte_updated 6239297430 0 mmu_pte_write 20470585377 0 mmu_recycled 0 0 mmu_shadow_zapped 335324686 0 pf_fixed 8324827214 0 pf_guest 2229555421 0 remote_tlb_flush 283772247 0 request_irq0 0 signal_exits 6 0 tlb_flush 533453055666 ... but there are 6 other guests on this host running just fine. Can't connect gdb to the kvm's gdbserver. It just says Remote 'g' packet reply is too long: Issuing system_reset stalled for a minute, then rebooted the guest. After reboot, guest is find again. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-06 17:07 Message: I'm curious about the status of this issue? I'm experiencing the same problem apparently randomly on guests. Restarting the network interface does not seem to fix the problem. Only a reboot (or system_reset in kvm/qemu console) solves it. Last time it happened was on a host with kvm-84 and guest with kernel 2.6.27 using virtio-net. Leading up to the stall it had a traffic load of about 5 mbit constantly up and down for just over 2 hours with one 12 mbit spikes. I did not check interface counters or network traces at the time. -- Comment By: Fabio Coatti (cova)
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote: pci_enable_msix currently returns -EINVAL if you ask for more vectors than supported by the device, which would typically cause fallback to regular interrupts. It's better to return the table size, making the driver retry MSI-X with less vectors. Hi Michael I think driver should read from capability list to know how many vector supported by this device before enable MSI-X for device, as pci_msix_table_size() did... -- regards Yang, Sheng Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Hi Jesse, This came up when I was adding MSI-X support to virtio pci driver, which does not know the exact table size upfront. Could you consider this patch for 2.6.31 please? drivers/pci/msi.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 6f2e629..f5bd1c9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev) * indicates the successful configuration of MSI-X capability structure * with new allocated MSI-X irqs. A return of 0 indicates a failure. * Or a return of 0 indicates that driver request is exceeding the number - * of irqs available. Driver should use the returned value to re-send - * its request. + * of irqs or MSI-X vectors available. Driver should use the returned value to + * re-send its request. **/ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) { @@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) nr_entries = pci_msix_table_size(dev); if (nvec nr_entries) - return -EINVAL; + return nr_entries; /* Check for any invalid entries */ for (i = 0; i nvec; i++) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2506814 ] TAP network lockup after some traffic
Bugs item #2506814, was opened at 2009-01-14 11:38 Message generated for change (Comment added) made by cova You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Fabio Coatti (cova) Assigned to: Nobody/Anonymous (nobody) Summary: TAP network lockup after some traffic Initial Comment: Hi all, we are experiencing severe network troubles using kvm+tap networking. basically, after some network load (we have yet to identify the exact amount of traffic if one exist) network stops working. During lockups, With tcpdump we can see arp requests on guest interface, then on tap, brX and physical interfaces on host system. the arp answers can be seen, with tcpdump, only on physical host interface and bridge (brX), but not on tap device. Basically it seems that packets coming from external network get lost in tap device on the way to guest (kvm). Looking at tap data with ifconfig, the only weird thing is the TX packets overrun count that is 0. every time the network stops working, overrun count increases. This has been observed with several kvm releases (for sure, 76/77/78/79/80/81/82) and with different kernels (tried with some versions among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side. we tried several network drivers (virtio, e1000, rtl) and all shows the same problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has acceptable performance) (btw: on host machine we have vlan on top of ethX devices) cpu number on guest makes no difference. we tried with vanilla kernel provided kvm modules and with kvm package provided modules. guest: 32 bit host: 64 bit host machine: 2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo. Of course I can provide more details or perform other tests and try patches, if someone can give me some hints and advices they will be most welcome. Thanks. -- Comment By: Fabio Coatti (cova) Date: 2009-05-07 10:51 Message: We are still biten by this issue. I'm running out of ideas, but if someone can give me some hints on how to track down the problem or at least collect more information I'll try it. Thanks. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-07 10:39 Message: Just happened again. Seems like it stops generating interrupts on virtio devices: 10: 3001 51199 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 doing an /bin/ls froze the guest for a minute or so until a CTRL-C got through. After /bin/ls: 10: 3001 51220 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 kvm_stat: efer_reload0 0 exits 19150288617 1082 fpu_reload 11976730134 halt_exits 544721221 194 halt_wakeup300187634 141 host_state_reload 837279034 259 hypercalls4991133618 0 insn_emulation4797940911 709 insn_emulation_fail0 0 invlpg 0 0 io_exits 28125446265 irq_exits 173205458 0 irq_window 0 0 largepages 0 0 mmio_exits720457 0 mmu_cache_miss 290275800 0 mmu_flooded289036380 0 mmu_pde_zapped 170308104 0 mmu_pte_updated 6239297430 0 mmu_pte_write 20470585377 0 mmu_recycled 0 0 mmu_shadow_zapped 335324686 0 pf_fixed 8324827214 0 pf_guest 2229555421 0 remote_tlb_flush 283772247 0 request_irq0 0 signal_exits 6 0 tlb_flush 533453055666 ... but there are 6 other guests on this host running just fine. Can't connect gdb to the kvm's gdbserver. It just says Remote 'g' packet reply is too long: Issuing system_reset stalled for a minute, then rebooted the guest. After reboot, guest is find again. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-06 17:07 Message: I'm curious about the status of this issue? I'm experiencing the same problem apparently randomly on guests. Restarting the network interface does not seem to fix the problem. Only a reboot (or system_reset in kvm/qemu console) solves it. Last time it happened was on a host with kvm-84 and guest with kernel
[patch] fix - do not build blobs when blobs are not needed
Hi, Don't like ugly assembler errors because QEMU tries to build some x86 assembly code which isn't needed on non-x86. Jes Do not try to build extboot when INSTALL_BLOBS is not set. Signed-off-by: Jes Sorensen j...@sgi.com --- Makefile |2 ++ configure |1 + 2 files changed, 3 insertions(+) Index: qemu-kvm/Makefile === --- qemu-kvm.orig/Makefile +++ qemu-kvm/Makefile @@ -419,6 +419,7 @@ .PHONY: kvm/extboot +ifdef INSTALL_BLOBS all: kvm/extboot kvm/extboot: @@ -427,3 +428,4 @@ || ! cmp -s pc-bios/extboot.bin $@/extboot.bin; then \ cp $@/extboot.bin pc-bios/extboot.bin; \ fi +endif Index: qemu-kvm/configure === --- qemu-kvm.orig/configure +++ qemu-kvm/configure @@ -356,6 +356,7 @@ cpu_emulation=no gdbstub=no slirp=no + blobs=no fi if [ $cpu = powerpc ]; then kvm=yes
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thursday 07 May 2009 17:05:06 Michael S. Tsirkin wrote: On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote: On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote: pci_enable_msix currently returns -EINVAL if you ask for more vectors than supported by the device, which would typically cause fallback to regular interrupts. It's better to return the table size, making the driver retry MSI-X with less vectors. Hi Michael I think driver should read from capability list to know how many vector supported by this device before enable MSI-X for device, as pci_msix_table_size() did... Drivers can do this, but it's more code. Since pci_enable_msix calls pci_msix_table_size already, let it do the work. Right? If you don't know the vectors number before you enable MSI-X, how can you setup vectors? I don't think it's proper way to go. -- regards Yang, Sheng -- regards Yang, Sheng Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Hi Jesse, This came up when I was adding MSI-X support to virtio pci driver, which does not know the exact table size upfront. Could you consider this patch for 2.6.31 please? drivers/pci/msi.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 6f2e629..f5bd1c9 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev) * indicates the successful configuration of MSI-X capability structure * with new allocated MSI-X irqs. A return of 0 indicates a failure. * Or a return of 0 indicates that driver request is exceeding the number - * of irqs available. Driver should use the returned value to re-send - * its request. + * of irqs or MSI-X vectors available. Driver should use the returned value to + * re-send its request. **/ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) { @@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) nr_entries = pci_msix_table_size(dev); if (nvec nr_entries) - return -EINVAL; + return nr_entries; /* Check for any invalid entries */ for (i = 0; i nvec; i++) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix - do not build blobs when blobs are not needed
Jes Sorensen wrote: Hi, Don't like ugly assembler errors because QEMU tries to build some x86 assembly code which isn't needed on non-x86. .PHONY: kvm/extboot +ifdef INSTALL_BLOBS all: kvm/extboot kvm/extboot: @@ -427,3 +428,4 @@ || ! cmp -s pc-bios/extboot.bin $@/extboot.bin; then \ cp $@/extboot.bin pc-bios/extboot.bin; \ fi +endif Index: qemu-kvm/configure === --- qemu-kvm.orig/configure +++ qemu-kvm/configure @@ -356,6 +356,7 @@ cpu_emulation=no gdbstub=no slirp=no + blobs=no fi if [ $cpu = powerpc ]; then kvm=yes Seems a bit heavy handed. blobs != extboot. How about instead: build-targets-x86 = kvm/extboot all: $(build-targets-$(ARCH)) (or something similar that builds) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote: On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote: pci_enable_msix currently returns -EINVAL if you ask for more vectors than supported by the device, which would typically cause fallback to regular interrupts. It's better to return the table size, making the driver retry MSI-X with less vectors. Hi Michael I think driver should read from capability list to know how many vector supported by this device before enable MSI-X for device, as pci_msix_table_size() did... I think Michael's patch makes sense. It reduces the amount of work the driver has to do without requiring any additional work in the core. I don't see the disadvantage to it. Reviewed-by: Matthew Wilcox wi...@linux.intel.com -- Matthew Wilcox Intel Open Source Technology Centre Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Zhang, Xiantao wrote: Jes Sorensen wrote: Hi, The latest changes to qemu-kvm breaks miserably if one tries to build without CONFIG_GDBSTUB. Jes --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -4417,13 +4417,11 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB ^^^ Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined. This was introduced by commit 704aec581c1683750e313832ba3aa4813d59cbd0 Author: Xiantao Zhang xiantao.zh...@intel.com Date: Thu Nov 27 17:23:27 2008 +0800 Build fix for !CONFIG_GDBSTUB case Once CONFIG_GDBSTUB not configured, compile will generate error In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 05:10:46PM +0800, Sheng Yang wrote: I think driver should read from capability list to know how many vector supported by this device before enable MSI-X for device, as pci_msix_table_size() did... Drivers can do this, but it's more code. Since pci_enable_msix calls pci_msix_table_size already, let it do the work. Right? If you don't know the vectors number before you enable MSI-X, how can you setup vectors? I don't know how many irqs I will be able to get anyway. vectors that can't be assigned are useless ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix irq routing by moving it to the correct place
Jes Sorensen wrote: Hi, Some freak :-) put x86 specific irq routing into the generic code path, which obviously won't work on non-x86 systems. This patch creates kvm_arch_init_irq_routing() and moves the x86 gibberish to it's correct location :-) Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix - do not build blobs when blobs are not needed
Avi Kivity wrote: Seems a bit heavy handed. blobs != extboot. How about instead: build-targets-x86 = kvm/extboot all: $(build-targets-$(ARCH)) (or something similar that builds) There's no problem that cannot be fixed by applying an appropriately large hammer :-) This one works for me. Jes Do not try to build extboot on non-x86. Signed-off-by: Jes Sorensen j...@sgi.com --- Makefile |5 - 1 file changed, 4 insertions(+), 1 deletion(-) Index: qemu-kvm/Makefile === --- qemu-kvm.orig/Makefile +++ qemu-kvm/Makefile @@ -419,7 +419,10 @@ .PHONY: kvm/extboot -all: kvm/extboot +build-targets-x86 = kvm/extboot +build-targets-ia64 = + +all: $(build-targets-$(ARCH)) kvm/extboot: $(MAKE) -C $@
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thursday 07 May 2009 17:27:31 Matthew Wilcox wrote: On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote: On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote: pci_enable_msix currently returns -EINVAL if you ask for more vectors than supported by the device, which would typically cause fallback to regular interrupts. It's better to return the table size, making the driver retry MSI-X with less vectors. Hi Michael I think driver should read from capability list to know how many vector supported by this device before enable MSI-X for device, as pci_msix_table_size() did... I think Michael's patch makes sense. It reduces the amount of work the driver has to do without requiring any additional work in the core. I don't see the disadvantage to it. Reviewed-by: Matthew Wilcox wi...@linux.intel.com It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Davide Libenzi wrote: On Wed, 6 May 2009, Gregory Haskins wrote: I think we are ok in this regard (at least in v5) without the callback. kvm holds irqfd, which holds eventfd. In a normal situation, we will have eventfd with 2 references. If userspace closes the eventfd, it will drop 1 of the 2 eventfd file references, but the object should remain intact as long as kvm still holds it as well. When the kvm-fd is released, we will then decouple from the eventfd-wqh and drop the last fput(), officially freeing it. Likewise, if kvm is closed before the eventfd, we will simply decouple from the wqh and fput(eventfd), leaving the last reference held by userspace until it closes as well. Let me know if you see any holes in that. Looks OK, modulo my knowledge of KVM internals. What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix - do not build blobs when blobs are not needed
Jes Sorensen wrote: There's no problem that cannot be fixed by applying an appropriately large hammer :-) This one works for me. applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: Zhang, Xiantao wrote: Jes Sorensen wrote: Hi, The latest changes to qemu-kvm breaks miserably if one tries to build without CONFIG_GDBSTUB. Jes --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -4417,13 +4417,11 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB ^^^ Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined. This was introduced by commit 704aec581c1683750e313832ba3aa4813d59cbd0 Author: Xiantao Zhang xiantao.zh...@intel.com Date: Thu Nov 27 17:23:27 2008 +0800 Build fix for !CONFIG_GDBSTUB case Once CONFIG_GDBSTUB not configured, compile will generate error In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. The original patch is 8c4379cc made by me in kvm-userspace.git and I used #ifdef, but accidently changed to #ifndef in qemu-kvm.git. So maybe this typo is introduced by Anthony Liguori when rewrote this patch for qemu-kvm.git ? Xiantao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote: It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... Here's a good example. Let's suppose you have a driver which supports two different models of cards, one has 16 MSI-X interrupts, the other has 10. You can call pci_enable_msix() asking for 16 vectors. If your card is model A, you get 16 interrupts. If your card is model B, it says you can have 10. This is less work in the driver (since it must implement falling back to a smaller number of interrupts *anyway*) than interrogating the card to find out how many interrupts there are, then requesting the right number, and still having the fallback path which is going to be less tested. -- Matthew Wilcox Intel Open Source Technology Centre Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Zhang, Xiantao wrote: Avi Kivity wrote: Zhang, Xiantao wrote: Jes Sorensen wrote: Hi, The latest changes to qemu-kvm breaks miserably if one tries to build without CONFIG_GDBSTUB. Jes --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -4417,13 +4417,11 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB ^^^ Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined. This was introduced by commit 704aec581c1683750e313832ba3aa4813d59cbd0 Author: Xiantao Zhang xiantao.zh...@intel.com Date: Thu Nov 27 17:23:27 2008 +0800 Build fix for !CONFIG_GDBSTUB case Once CONFIG_GDBSTUB not configured, compile will generate error In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. The original patch is 8c4379cc made by me in kvm-userspace.git and I used #ifdef, but accidently changed to #ifndef in qemu-kvm.git. So maybe this typo is introduced by Anthony Liguori when rewrote this patch for qemu-kvm.git ? Xiantao That commit got translated to fe538e6f in qemu-kvm.git. Anthony, what's the reason for the difference between fe538e6f and 704aec58? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] qemu-kvm define cpu_has_work() for ia64
Jes Sorensen wrote: Hi, This one is needed to build on ia64. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-net: Fix save/load
Alex Williamson wrote: Thanks Avi, I think you got the important bits. I might recommend the following change to keep this contained to a version_id 7 load and bail if vnet header support is required, but not available. Thanks, Alex kvm: virtio-net: Cleanup load from vnet header saved image Bail if the saved image requires vnet header support. Thanks, applied. I even thought of this while applying the original patch but I guess my attention span is in a downward spiral so I forgot about it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote: On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote: It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... Here's a good example. Let's suppose you have a driver which supports two different models of cards, one has 16 MSI-X interrupts, the other has 10. You can call pci_enable_msix() asking for 16 vectors. If your card is model A, you get 16 interrupts. If your card is model B, it says you can have 10. This is less work in the driver (since it must implement falling back to a smaller number of interrupts *anyway*) than interrogating the card to find out how many interrupts there are, then requesting the right number, and still having the fallback path which is going to be less tested. Yeah, partly understand now. But the confusing of return value is not that pleasure compared to this benefit. And even you have to fall back if return 0 anyway, but in the past, you just need fall back once at most; but now you may fall back twice. This make thing more complex - you need either two ifs or a simple loop. And just one if can deal with it before. All that required is one call for pci_msix_table_size(), and I believe most driver would like to know how much vector it have before it fill the vectors, so mostly no extra cost. But for this ambiguous return meaning, you have to add more code for fall back - yes, the driver may can assert that the positive return value always would be irq numbers if it call pci_msix_table_size() before, but is it safe in logic? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] kvm-s390: use hrtimer for clock wakeup from idle
Christian Borntraeger wrote: Why can't this be done from the timer context (after adjusting the locks)? It can be done, in fact I did that in my first version. The thing is, we would need to change all local_int.locks to spinlock_irqs, since standard timers are softirqs and hrtimers and hardirqs. Enabling and disabling irqs is a relatively expensive operating on s390 (but locks via compare and swap are quite cheap). Since we take this specific lock in lots of places this lock is on some hot pathes. The idle wakeup on the other hand is not that critical. Makes sense. Back when irqsave/irqrestore were expensive on x86 (P4 days, I think it was ~100 cycles) there was talk of using a software irq disable flag instead of using the hardware support. So - local_irq_disable() sets a flag - interrupt handlers check the flag, if set they schedule the interrupt handler and return immediately - local_irq_restore() clears the flag and fires and scheduled handlers Now these operations are pretty cheap on x86, but maybe that can apply to s390. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, 2009-05-07 at 03:53 -0600, Matthew Wilcox wrote: On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote: It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... Here's a good example. Let's suppose you have a driver which supports two different models of cards, one has 16 MSI-X interrupts, the other has 10. You can call pci_enable_msix() asking for 16 vectors. If your card is model A, you get 16 interrupts. If your card is model B, it says you can have 10. This is less work in the driver (since it must implement falling back to a smaller number of interrupts *anyway*) than interrogating the card to find out how many interrupts there are, then requesting the right number, and still having the fallback path which is going to be less tested. Not to mention that there's no guarantee that you'll get as many interrupts as the device supports, so you should really be coding to cope with that anyway. Like the example in MSI-HOWTO.txt: 197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec) 198 { 199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) { 200 rc = pci_enable_msix(adapter-pdev, 201 adapter-msix_entries, nvec); 202 if (rc 0) 203 nvec = rc; 204 else 205 return rc; 206 } 207 208 return -ENOSPC; 209 } So I agree, this patch is an improvement. cheers signature.asc Description: This is a digitally signed message part
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thursday 07 May 2009 18:23:50 Michael Ellerman wrote: On Thu, 2009-05-07 at 03:53 -0600, Matthew Wilcox wrote: On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote: It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... Here's a good example. Let's suppose you have a driver which supports two different models of cards, one has 16 MSI-X interrupts, the other has 10. You can call pci_enable_msix() asking for 16 vectors. If your card is model A, you get 16 interrupts. If your card is model B, it says you can have 10. This is less work in the driver (since it must implement falling back to a smaller number of interrupts *anyway*) than interrogating the card to find out how many interrupts there are, then requesting the right number, and still having the fallback path which is going to be less tested. Not to mention that there's no guarantee that you'll get as many interrupts as the device supports, so you should really be coding to cope with that anyway. Like the example in MSI-HOWTO.txt: 197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec) 198 { 199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) { 200 rc = pci_enable_msix(adapter-pdev, 201 adapter-msix_entries, nvec); 202 if (rc 0) 203 nvec = rc; 204 else 205 return rc; 206 } 207 208 return -ENOSPC; 209 } So I agree, this patch is an improvement. Oh yeah. Forgot irq counts can also be changed from time to time. OK, there should be a loop, so that's fine. :) -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] kvm-s390: use hrtimer for clock wakeup from idle
Am Thursday 07 May 2009 12:19:32 schrieb Avi Kivity: Makes sense. Back when irqsave/irqrestore were expensive on x86 (P4 days, I think it was ~100 cycles) there was talk of using a software irq disable flag instead of using the hardware support. So - local_irq_disable() sets a flag - interrupt handlers check the flag, if set they schedule the interrupt handler and return immediately - local_irq_restore() clears the flag and fires and scheduled handlers Now these operations are pretty cheap on x86, but maybe that can apply to s390. Interesting idea. Nevertheless, I dont think it improve our situation. The affected instructions (stosm and stnsm) are more expensive than compare and swap, but its nowhere near the ~100 cycles of P4. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 06:19:53PM +0800, Sheng Yang wrote: On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote: On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote: It's indeed weird. Why the semantic of pci_enable_msix can be changed to enable msix, or tell me how many vector do you have? You can simply call pci_msix_table_size() to get what you want, also without any more work, no? I can't understand... Here's a good example. Let's suppose you have a driver which supports two different models of cards, one has 16 MSI-X interrupts, the other has 10. You can call pci_enable_msix() asking for 16 vectors. If your card is model A, you get 16 interrupts. If your card is model B, it says you can have 10. This is less work in the driver (since it must implement falling back to a smaller number of interrupts *anyway*) than interrogating the card to find out how many interrupts there are, then requesting the right number, and still having the fallback path which is going to be less tested. Yeah, partly understand now. But the confusing of return value is not that pleasure compared to this benefit. And even you have to fall back if return 0 anyway, but in the past, you just need fall back once at most; but now you may fall back twice. I don't think that's right - you might not be able to get the number of interrupts that pci_enable_msix reported. This make thing more complex - you need either two ifs or a simple loop. And just one if can deal with it before. All that required is one call for pci_msix_table_size(), and I believe most driver would like to know how much vector it have before it fill the vectors, so mostly no extra cost. But for this ambiguous return meaning, you have to add more code for fall back - yes, the driver may can assert that the positive return value always would be irq numbers if it call pci_msix_table_size() before, but is it safe in logic? If you know how many vectors the card has, then the only failure mode is when you are out of irqs. No change there. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
Michael Ellerman wrote: Not to mention that there's no guarantee that you'll get as many interrupts as the device supports, so you should really be coding to cope with that anyway. Like the example in MSI-HOWTO.txt: 197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec) 198 { 199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) { 200 rc = pci_enable_msix(adapter-pdev, 201 adapter-msix_entries, nvec); 202 if (rc 0) 203 nvec = rc; 204 else 205 return rc; 206 } 207 208 return -ENOSPC; 209 } So I agree, this patch is an improvement. I imagine this loop is present in many drivers. So why not add a helper static int pci_enable_msix_min(struct foo_adapter *adapter, int min_nvec) { int nvec = 2048; while (nvec = min_nvec) { rc = pci_enable_msix(adapter-pdev, adapter-msix_entries, nvec); if (rc == 0) return nvec; else if (rc 0) nvec = rc; else return rc; } return -ENOSPC; } -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: Zhang, Xiantao wrote: Jes Sorensen wrote: Hi, The latest changes to qemu-kvm breaks miserably if one tries to build without CONFIG_GDBSTUB. Jes --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -4417,13 +4417,11 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB ^^^ Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined. This was introduced by commit 704aec581c1683750e313832ba3aa4813d59cbd0 Author: Xiantao Zhang xiantao.zh...@intel.com Date: Thu Nov 27 17:23:27 2008 +0800 Build fix for !CONFIG_GDBSTUB case Once CONFIG_GDBSTUB not configured, compile will generate error In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. Given that CONFIG_GDBSTUB is always true upstream, I tend to say: Finally drop it upstream and keep this (or a more arch-local) workaround for missing ia64 gdbstub support downstream. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. Given that CONFIG_GDBSTUB is always true upstream, I tend to say: Finally drop it upstream and keep this (or a more arch-local) workaround for missing ia64 gdbstub support downstream. I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: Jan Kiszka wrote: In upstream. Please fix it in upstream and qemu-kvm.git will get the fix from there. Given that CONFIG_GDBSTUB is always true upstream, I tend to say: Finally drop it upstream and keep this (or a more arch-local) workaround for missing ia64 gdbstub support downstream. I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. The last ia64 tcg patch I saw one or two days ago was missing gdb support, too (and may suffer from some brokenness for that reason). But I would suggest for both cases (tcg and kvm) to temporarily provide stubs/placeholders instead of keeping the central #ifdefs. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. The last ia64 tcg patch I saw one or two days ago was missing gdb support, too (and may suffer from some brokenness for that reason). I didn't see ia64/tcg. Did you mean s390? That was host only IIRC. But I would suggest for both cases (tcg and kvm) to temporarily provide stubs/placeholders instead of keeping the central #ifdefs. Agreed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2506814 ] TAP network lockup after some traffic
Bugs item #2506814, was opened at 2009-01-14 11:38 Message generated for change (Comment added) made by mellen You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Fabio Coatti (cova) Assigned to: Nobody/Anonymous (nobody) Summary: TAP network lockup after some traffic Initial Comment: Hi all, we are experiencing severe network troubles using kvm+tap networking. basically, after some network load (we have yet to identify the exact amount of traffic if one exist) network stops working. During lockups, With tcpdump we can see arp requests on guest interface, then on tap, brX and physical interfaces on host system. the arp answers can be seen, with tcpdump, only on physical host interface and bridge (brX), but not on tap device. Basically it seems that packets coming from external network get lost in tap device on the way to guest (kvm). Looking at tap data with ifconfig, the only weird thing is the TX packets overrun count that is 0. every time the network stops working, overrun count increases. This has been observed with several kvm releases (for sure, 76/77/78/79/80/81/82) and with different kernels (tried with some versions among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side. we tried several network drivers (virtio, e1000, rtl) and all shows the same problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has acceptable performance) (btw: on host machine we have vlan on top of ethX devices) cpu number on guest makes no difference. we tried with vanilla kernel provided kvm modules and with kvm package provided modules. guest: 32 bit host: 64 bit host machine: 2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo. Of course I can provide more details or perform other tests and try patches, if someone can give me some hints and advices they will be most welcome. Thanks. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-07 13:29 Message: Just had another stall. Different host, different guest. Just one guest on that host system. What information would help debug this system? efer_reload0 0 exits 11147836817 5827 fpu_reload 31483547 1 halt_exits8041208101 1397 halt_wakeup 5069136218 0 host_state_reload 13598781750 1507 hypercalls5890581948 4 insn_emulation8114024171 4338 insn_emulation_fail0 0 invlpg 0 0 io_exits 5263494017 2 irq_exits 78104521153 irq_window 0 0 largepages 0 0 mmio_exits 61151 0 mmu_cache_miss 156941726 0 mmu_flooded158265518 0 mmu_pde_zapped 96062900 0 mmu_pte_updated 2302220251 4 mmu_pte_write 1759022770 4 mmu_recycled 0 0 mmu_shadow_zapped 184691156 0 pf_fixed 4270608276 0 pf_guest 1226004351 0 remote_tlb_flush 481003516 2 request_irq0 0 signal_exits 1 0 tlb_flush 453965557229 -- Comment By: Fabio Coatti (cova) Date: 2009-05-07 10:51 Message: We are still biten by this issue. I'm running out of ideas, but if someone can give me some hints on how to track down the problem or at least collect more information I'll try it. Thanks. -- Comment By: Tais M. Hansen (mellen) Date: 2009-05-07 10:39 Message: Just happened again. Seems like it stops generating interrupts on virtio devices: 10: 3001 51199 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 doing an /bin/ls froze the guest for a minute or so until a CTRL-C got through. After /bin/ls: 10: 3001 51220 IO-APIC-fasteoi virtio1, virtio2 11:76948358468134 IO-APIC-fasteoi virtio0 kvm_stat: efer_reload0 0 exits 19150288617 1082 fpu_reload 11976730134 halt_exits 544721221 194 halt_wakeup300187634 141 host_state_reload 837279034 259 hypercalls4991133618 0 insn_emulation4797940911 709 insn_emulation_fail0 0 invlpg 0 0
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: Jan Kiszka wrote: I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. The last ia64 tcg patch I saw one or two days ago was missing gdb support, too (and may suffer from some brokenness for that reason). I didn't see ia64/tcg. Did you mean s390? That was host only IIRC. Ah, you are right. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix irq routing by moving it to the correct place
Avi Kivity wrote: Jes Sorensen wrote: Hi, Some freak :-) put x86 specific irq routing into the generic code path, which obviously won't work on non-x86 systems. This patch creates kvm_arch_init_irq_routing() and moves the x86 gibberish to it's correct location :-) Applied, thanks. Argh, that didn't build on x86. I fixed it up. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 01:44:38PM +0300, Avi Kivity wrote: I imagine this loop is present in many drivers. So why not add a helper Let's look! arch/x86/kernel/amd_iommu_init.c : Needs an exact number of vectors. drivers/block/cciss.c : If it doesn't get all 4 vectors, falls back to pin mode (instead of MSI mode -- bug!) drivers/dma/ioat_dma.c : Falls back down to 1 vector if it can't get nvec drivers/infiniband/hw/mthca/mthca_main.c : Reverts to MSI if it can't get 3. drivers/net/benet/be_main.c : Falls back to MSI if it can't get 2. drivers/net/bnx2.c : Falls back to MSI if it can't get 9. drivers/net/bnx2x_main.c : Falls back to MSI if it can't get N. drivers/net/e1000e/netdev.c: Falls back to MSI if it can't get N. drivers/net/enic/enic_main.c: Falls back to MSI if it can't get N. drivers/net/forcedeth.c: Falls back to MSI if it can't get N. drivers/net/igb/igb_main.c: Falls back to MSI if it can't get N. drivers/net/igbvf/netdev.c: Falls back to MSI if it can't get 3. drivers/net/myri10ge/myri10ge.c: Falls back to Pin if it can't get N. drivers/net/netxen/netxen_nic_main.c: Falls back to MSI if it can't get N. drivers/net/qlge/qlge_main.c: Falls back to MSI if it can't get N. drivers/net/s2io.c: Falls back to MSI if it can't get N. drivers/net/vxge/vxge-main.c: Falls back once to a second number. drivers/pci/pcie/portdrv_core.c: Falls back to MSI if it can't get all of them. drivers/scsi/lpfc/lpfc_init.c: Falls back to MSI if it can't get N. drivers/scsi/mpt2sas/mpt2sas_base.c: Only allocates 1. drivers/net/cxgb3/cxgb3_main.c: Actually falls back ... in a bit of a weird way. This one could definitely do with the proposed loop. drivers/net/ixgbe/ixgbe_main.c: Could also be improved with this loop. drivers/net/mlx4/main.c: Nasty goto-based loop. drivers/net/niu.c: Ditto drivers/net/sfc/efx.c: Only falls back once. Would benefit from loop. drivers/scsi/qla2xxx/qla_isr.c: Only falls back once. Would benefit from loop. drivers/staging/sxg/sxg.c: /*Should try with less vector returned.*/ so ... 7 drivers would benefit from this loop. 20 seem like they wouldn't. What a lot of drivers seem to do is fall back either to a single MSI or just pin-based interrupts when they can't get as many interrupts as they would like. They don't try to get a single MSI-X interrupt. I feel an update to the MSI-HOWTO coming on. -- Matthew Wilcox Intel Open Source Technology Centre Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm: bios: Use a different mask to size the option ROM BAR
Alex Williamson wrote: Bit 0 is the enable bit, which we not only don't want to set, but it will stick and make us think it's an I/O port resource. Signed-off-by: Alex Williamson alex.william...@hp.com --- kvm/bios/rombios32.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index 8684987..6502e63 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -958,11 +958,13 @@ static void pci_bios_init_device(PCIDevice *d) int ofs; uint32_t val, size ; -if (i == PCI_ROM_SLOT) +if (i == PCI_ROM_SLOT) { ofs = 0x30; -else +pci_config_writel(d, ofs, 0xfffe); +} else { ofs = 0x10 + i * 4; -pci_config_writel(d, ofs, 0x); +pci_config_writel(d, ofs, 0x); +} val = pci_config_readl(d, ofs); if (val != 0) { size = (~(val ~0xf)) + 1; Looks good. I think it applies upstream (or even upstream's upstream) though, in which case please send it qemu-devel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix assigned device with no irq
Sheng Yang wrote: Some device like VF of SRIOV only support MSI-X. With this patch, SRIOV can be enabled with KVM assigned device(all kernel space patches are ready). Just noticed I missed this patch. Is it still needed or is it obsolete? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d2be16..ce80f3a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id); assigned_dev-irq_source_id = -1; - if (!assigned_dev-irq_requested_type) + if (!assigned_dev-irq_requested_type) { + if (assigned_dev-dev-msi_enabled) + pci_disable_msi(assigned_dev-dev); return; + } /* * In kvm_free_device_irq, cancel_work_sync return true if: @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm, if (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_INTX) return 0; + /* IRQ 0 means uninitialized here */ + if (airq-host_irq == 0) + return 0; + if (irqchip_in_kernel(kvm)) { if (!msi2intx (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_MSI)) { @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, r = 0; } else { printk(KERN_WARNING - kvm: failed to enable MSI device!\n); - r = -ENOTTY; - goto out_release; + kvm: device didn't support INTx or MSI!\n); + r = 0; } } else { /* Non-sharing INTx mode */ -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix assigned device with no irq
On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote: Sheng Yang wrote: Some device like VF of SRIOV only support MSI-X. With this patch, SRIOV can be enabled with KVM assigned device(all kernel space patches are ready). Just noticed I missed this patch. Is it still needed or is it obsolete? I think its obsolete. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d2be16..ce80f3a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id); assigned_dev-irq_source_id = -1; - if (!assigned_dev-irq_requested_type) +if (!assigned_dev-irq_requested_type) { +if (assigned_dev-dev-msi_enabled) +pci_disable_msi(assigned_dev-dev); return; +} /* * In kvm_free_device_irq, cancel_work_sync return true if: @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm, if (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_INTX) return 0; + /* IRQ 0 means uninitialized here */ +if (airq-host_irq == 0) +return 0; + if (irqchip_in_kernel(kvm)) { if (!msi2intx (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_MSI)) { @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, r = 0; } else { printk(KERN_WARNING - kvm: failed to enable MSI device!\n); -r = -ENOTTY; -goto out_release; + kvm: device didn't support INTx or MSI!\n); +r = 0; } } else { /* Non-sharing INTx mode */ -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix assigned device with no irq
On Thursday 07 May 2009 20:28:12 Marcelo Tosatti wrote: On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote: Sheng Yang wrote: Some device like VF of SRIOV only support MSI-X. With this patch, SRIOV can be enabled with KVM assigned device(all kernel space patches are ready). Just noticed I missed this patch. Is it still needed or is it obsolete? I think its obsolete. Yeah, it have been fixed from another direct in userspace: Interrupt pin =0 means it haven't been initialized. -- regards Yang, Sheng diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d2be16..ce80f3a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id); assigned_dev-irq_source_id = -1; - if (!assigned_dev-irq_requested_type) + if (!assigned_dev-irq_requested_type) { + if (assigned_dev-dev-msi_enabled) + pci_disable_msi(assigned_dev-dev); return; + } /* * In kvm_free_device_irq, cancel_work_sync return true if: @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm, if (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_INTX) return 0; + /* IRQ 0 means uninitialized here */ + if (airq-host_irq == 0) + return 0; + if (irqchip_in_kernel(kvm)) { if (!msi2intx (adev-irq_requested_type KVM_ASSIGNED_DEV_HOST_MSI)) { @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, r = 0; } else { printk(KERN_WARNING - kvm: failed to enable MSI device!\n); - r = -ENOTTY; - goto out_release; + kvm: device didn't support INTx or MSI!\n); + r = 0; } } else { /* Non-sharing INTx mode */ -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. Hi, Here's a patch that fixes the #ifndef to make it the #ifdef as it was intended. I am quite fine with us trying to drop all the #ifdefs and introduce noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but to start with we better just fix the #ifndef to make it behave like it was originally intended. Cheers, Jes Fix incorrect #ifndef for CONFIG_GETSTUB, which should have been an #ifdef. Signed-off-by: Jes Sorensen j...@sgi.com --- vl.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: qemu/vl.c === --- qemu.orig/vl.c +++ qemu/vl.c @@ -4350,7 +4350,7 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB +#ifdef CONFIG_GDBSTUB if (ret == EXCP_DEBUG) { gdb_set_stop_cpu(env); debug_requested = 1;
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jes Sorensen wrote: Avi Kivity wrote: I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. Hi, Here's a patch that fixes the #ifndef to make it the #ifdef as it was intended. I am quite fine with us trying to drop all the #ifdefs and introduce noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but to start with we better just fix the #ifndef to make it behave like it was originally intended. As the original change this patch (correctly) fixes broke gdbstub support in qemu-kvm for all archs + it is still a deviation from upstream, please take the chance and go for ia64 stubs. BTW, I'm planning to submit a cleanup patch for CONFIG_GDBSTUB to upstream soon. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: As the original change this patch (correctly) fixes broke gdbstub support in qemu-kvm for all archs + it is still a deviation from upstream, please take the chance and go for ia64 stubs. Jan, There's far more to do in upstream than just gdbstubs for ia64 to be usable :-( We'll get to gdbstubs at some point, but there's too many other higher priority items to deal with first. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jes Sorensen wrote: Jan Kiszka wrote: As the original change this patch (correctly) fixes broke gdbstub support in qemu-kvm for all archs + it is still a deviation from upstream, please take the chance and go for ia64 stubs. Jan, There's far more to do in upstream than just gdbstubs for ia64 to be usable :-( We'll get to gdbstubs at some point, but there's too many other higher priority items to deal with first. I'm not asking for full gdbstub support at this point, just for a build workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.) instead of patching generic code with global risk of regressions. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/4] use smp_send_reschedule in vcpu_kick / assigned dev host intx race fix
Sheng Yang wrote: Is there any issue blocked this patchset? Yes, a slow maintainer. I'll go review it now. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] qemu: external module: smp_send_reschedule compat
mtosa...@redhat.com wrote: smp_send_reschedule was exported (via smp_ops) in v2.6.24. Create a compat function which schedules the IPI to keventd context, in case interrupts are disabled, for kernels 2.6.24. #endif +#if LINUX_VERSION_CODE KERNEL_VERSION(2,6,28) +#define nr_cpu_ids NR_CPUS +#endif + Doesn't seem related? Please send as a separate patch. diff --git a/kvm/kernel/x86/hack-module.awk b/kvm/kernel/x86/hack-module.awk index 260eeef..f4d14da 100644 --- a/kvm/kernel/x86/hack-module.awk +++ b/kvm/kernel/x86/hack-module.awk @@ -35,6 +35,8 @@ BEGIN { split(INIT_WORK desc_struct ldttss_desc64 desc_ptr \ { sub(/match-dev-msi_enabled/, kvm_pcidev_msi_enabled(match-dev)) } +{ sub(/smp_send_reschedule/, kvm_smp_send_reschedule) } + /^static void __vmx_load_host_state/ { vmx_load_host_state = 1 } There's a bit on the top of hack-module.awk that does this slightly simpler. Please also adjust ia64. This now resides in kvm-kmod.git, please patch against that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: I'm not asking for full gdbstub support at this point, just for a build workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.) instead of patching generic code with global risk of regressions. Oh, I see. I already submitted that to Avi, making gdb_set_stop_cpu() a noop on ia64. Since there's no target-ia64/ directory in mainline qemu.git yet, that fix will have to live in qemu-kvm for the time being. I am fine if you wish to remove all the #ifdef CONFIG_GDBSTUB bits. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote: Davide Libenzi wrote: On Wed, 6 May 2009, Gregory Haskins wrote: I think we are ok in this regard (at least in v5) without the callback. kvm holds irqfd, which holds eventfd. In a normal situation, we will have eventfd with 2 references. If userspace closes the eventfd, it will drop 1 of the 2 eventfd file references, but the object should remain intact as long as kvm still holds it as well. When the kvm-fd is released, we will then decouple from the eventfd-wqh and drop the last fput(), officially freeing it. Likewise, if kvm is closed before the eventfd, we will simply decouple from the wqh and fput(eventfd), leaving the last reference held by userspace until it closes as well. Let me know if you see any holes in that. Looks OK, modulo my knowledge of KVM internals. What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? (pasting from a separate thread) That's my thinking. PCI interrupts don't work because we need to do some hacky stuff in there, but MSI should. Oh, and we could improve UIO support for interrupts when using MSI, since there's no need to acknowledge the interrupt. Ok, so for INTx assigned devices all you need to do on the ACK handler is to re-enable the host interrupt (and set the guest interrupt line to low). Right now the ack comes through a kvm internal irq ack callback. AFAICS there is no mechanism in irqfd for ACK notification, and interrupt injection is edge triggered. So for PCI INTx assigned devices (or any INTx level), you'd want to keep the guest interrupt high, with some way to notify the ACK. Avi mentioned a separate irqfd to notify the ACK. For assigned devices, you could register a fd wakeup function in that fd, which replaces the current irq ACK callback? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Marcelo Tosatti wrote: On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote: Davide Libenzi wrote: On Wed, 6 May 2009, Gregory Haskins wrote: I think we are ok in this regard (at least in v5) without the callback. kvm holds irqfd, which holds eventfd. In a normal situation, we will have eventfd with 2 references. If userspace closes the eventfd, it will drop 1 of the 2 eventfd file references, but the object should remain intact as long as kvm still holds it as well. When the kvm-fd is released, we will then decouple from the eventfd-wqh and drop the last fput(), officially freeing it. Likewise, if kvm is closed before the eventfd, we will simply decouple from the wqh and fput(eventfd), leaving the last reference held by userspace until it closes as well. Let me know if you see any holes in that. Looks OK, modulo my knowledge of KVM internals. What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? (pasting from a separate thread) That's my thinking. PCI interrupts don't work because we need to do some hacky stuff in there, but MSI should. Oh, and we could improve UIO support for interrupts when using MSI, since there's no need to acknowledge the interrupt. Ok, so for INTx assigned devices all you need to do on the ACK handler is to re-enable the host interrupt (and set the guest interrupt line to low). Right now the ack comes through a kvm internal irq ack callback. AFAICS there is no mechanism in irqfd for ACK notification, and interrupt injection is edge triggered. So for PCI INTx assigned devices (or any INTx level), you'd want to keep the guest interrupt high, with some way to notify the ACK. Avi mentioned a separate irqfd to notify the ACK. For assigned devices, you could register a fd wakeup function in that fd, which replaces the current irq ACK callback? One thing I was thinking here was that I could create a flag for the kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR. This flag when specified at creation time will cause the event to execute a clear operation instead of a set when triggered. That way, the default mode is an edge-triggered set. The non-default mode is to trigger a clear. Level-triggered ints could therefore create two irqfds, one for raising, the other for clearing. An alternative is to abandon the use of eventfd, and allow the irqfd to be a first-class anon-fd. The parameters passed to the write/signal() function could then indicate the desired level. The disadvantage would be that it would not be compatible with eventfd, so we would need to decide if the tradeoff is worth it. OTOH, I suspect level triggered interrupts will be primarily in the legacy domain, so perhaps we do not need to worry about it too much. Therefore, another option is that we *could* simply set the stake in the ground that legacy/level cannot use irqfd. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
[PATCHv2 0/3] virtio: add guest MSI-X support
Add optional MSI-X support: use a vector per virtqueue with fallback to a common vector and finally to regular interrupt. Teach all drivers to use it. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Here's a draft set of patches for MSI-X support in the guest. It still needs to be tested properly, and performance impact measured, but I thought I'd share it here in the hope of getting some very early feedback/flames. Changelog since v1: - Per Avi's suggestion, let guest configure virtqueue to vector mapping - Per Rusty's suggestion, replace API with find_vqs/del_vqs. Michael S. Tsirkin (3): virtio: find_vqs/del_vqs virtio operations virtio_pci: split up vp_interrupt virtio_pci: optional MSI-X support drivers/block/virtio_blk.c | 11 +- drivers/char/hw_random/virtio-rng.c | 11 +- drivers/char/virtio_console.c | 27 ++-- drivers/lguest/lguest_device.c | 49 ++- drivers/net/virtio_net.c| 47 +++ drivers/s390/kvm/kvm_virtio.c | 64 - drivers/virtio/virtio_balloon.c | 31 ++-- drivers/virtio/virtio_pci.c | 283 ++- include/linux/virtio_config.h | 29 +++- include/linux/virtio_pci.h |8 +- 10 files changed, 440 insertions(+), 120 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] virtio: find_vqs/del_vqs virtio operations
This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations, and updates all drivers. This is needed for MSI support, because MSI needs to know the total number of vectors upfront. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/block/virtio_blk.c | 11 +++--- drivers/char/hw_random/virtio-rng.c | 11 +++--- drivers/char/virtio_console.c | 27 ++ drivers/lguest/lguest_device.c | 49 +- drivers/net/virtio_net.c| 47 +++--- drivers/s390/kvm/kvm_virtio.c | 64 +- drivers/virtio/virtio_balloon.c | 31 - drivers/virtio/virtio_pci.c | 43 +++ include/linux/virtio_config.h | 29 +++- 9 files changed, 222 insertions(+), 90 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 5d34764..7b7435d 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -197,6 +197,7 @@ static int virtblk_probe(struct virtio_device *vdev) u64 cap; u32 v; u32 blk_size, sg_elems; + virtqueue_callback *callback[] = { blk_done }; if (index_to_minor(index) = 1 MINORBITS) return -ENOSPC; @@ -224,11 +225,9 @@ static int virtblk_probe(struct virtio_device *vdev) sg_init_table(vblk-sg, vblk-sg_elems); /* We expect one virtqueue, for output. */ - vblk-vq = vdev-config-find_vq(vdev, 0, blk_done); - if (IS_ERR(vblk-vq)) { - err = PTR_ERR(vblk-vq); + err = vdev-config-find_vqs(vdev, 1, vblk-vq, callback); + if (err) goto out_free_vblk; - } vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { @@ -323,7 +322,7 @@ out_put_disk: out_mempool: mempool_destroy(vblk-pool); out_free_vq: - vdev-config-del_vq(vblk-vq); + vdev-config-del_vqs(vdev); out_free_vblk: kfree(vblk); out: @@ -344,7 +343,7 @@ static void virtblk_remove(struct virtio_device *vdev) blk_cleanup_queue(vblk-disk-queue); put_disk(vblk-disk); mempool_destroy(vblk-pool); - vdev-config-del_vq(vblk-vq); + vdev-config-del_vqs(vdev); kfree(vblk); } diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c index 86e83f8..18eabe4 100644 --- a/drivers/char/hw_random/virtio-rng.c +++ b/drivers/char/hw_random/virtio-rng.c @@ -91,16 +91,17 @@ static struct hwrng virtio_hwrng = { static int virtrng_probe(struct virtio_device *vdev) { + virtqueue_callback *callback[] = { random_recv_done }; int err; /* We expect a single virtqueue. */ - vq = vdev-config-find_vq(vdev, 0, random_recv_done); - if (IS_ERR(vq)) - return PTR_ERR(vq); + err = vdev-config-find_vqs(vdev, 1, vq, callback); + if (err) + return err; err = hwrng_register(virtio_hwrng); if (err) { - vdev-config-del_vq(vq); + vdev-config-del_vqs(vdev); return err; } @@ -112,7 +113,7 @@ static void virtrng_remove(struct virtio_device *vdev) { vdev-config-reset(vdev); hwrng_unregister(virtio_hwrng); - vdev-config-del_vq(vq); + vdev-config-del_vqs(vdev); } static struct virtio_device_id id_table[] = { diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index ff6f5a4..1fd5376 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -188,6 +188,8 @@ static void hvc_handle_input(struct virtqueue *vq) * Finally we put our input buffer in the input queue, ready to receive. */ static int __devinit virtcons_probe(struct virtio_device *dev) { + struct virtqueue *vqs[2]; + virtqueue_callback *callbacks[2]; int err; vdev = dev; @@ -199,20 +201,17 @@ static int __devinit virtcons_probe(struct virtio_device *dev) goto fail; } - /* Find the input queue. */ + /* Find the queues. */ /* FIXME: This is why we want to wean off hvc: we do nothing * when input comes in. */ - in_vq = vdev-config-find_vq(vdev, 0, hvc_handle_input); - if (IS_ERR(in_vq)) { - err = PTR_ERR(in_vq); + callbacks[0] = hvc_handle_input; + callbacks[1] = NULL; + err = vdev-config-find_vqs(vdev, 2, vqs, callbacks); + if (err) goto free; - } - out_vq = vdev-config-find_vq(vdev, 1, NULL); - if (IS_ERR(out_vq)) { - err = PTR_ERR(out_vq); - goto free_in_vq; - } + in_vq = vqs[0]; + out_vq = vqs[1]; /* Start using the new console output. */ virtio_cons.get_chars = get_chars; @@ -233,17 +232,15 @@ static int __devinit virtcons_probe(struct virtio_device *dev)
[PATCH 2/3] virtio_pci: split up vp_interrupt
This reorganizes virtio-pci code in vp_interrupt slightly, so that it's easier to add per-vq MSI support on top. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_pci.c | 45 +- 1 files changed, 35 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 3671c42..f7b79a2 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -164,6 +164,37 @@ static void vp_notify(struct virtqueue *vq) iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); } +/* Handle a configuration change: Tell driver if it wants to know. */ +static irqreturn_t vp_config_changed(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_driver *drv; + drv = container_of(vp_dev-vdev.dev.driver, + struct virtio_driver, driver); + + if (drv drv-config_changed) + drv-config_changed(vp_dev-vdev); + return IRQ_HANDLED; +} + +/* Notify all virtqueues on an interrupt. */ +static irqreturn_t vp_vring_interrupt(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_pci_vq_info *info; + irqreturn_t ret = IRQ_NONE; + unsigned long flags; + + spin_lock_irqsave(vp_dev-lock, flags); + list_for_each_entry(info, vp_dev-virtqueues, node) { + if (vring_interrupt(irq, info-vq) == IRQ_HANDLED) + ret = IRQ_HANDLED; + } + spin_unlock_irqrestore(vp_dev-lock, flags); + + return ret; +} + /* A small wrapper to also acknowledge the interrupt when it's handled. * I really need an EIO hook for the vring so I can ack the interrupt once we * know that we'll be handling the IRQ but before we invoke the callback since @@ -173,9 +204,6 @@ static void vp_notify(struct virtqueue *vq) static irqreturn_t vp_interrupt(int irq, void *opaque) { struct virtio_pci_device *vp_dev = opaque; - struct virtio_pci_vq_info *info; - irqreturn_t ret = IRQ_NONE; - unsigned long flags; u8 isr; /* reading the ISR has the effect of also clearing it so it's very @@ -187,14 +215,11 @@ static irqreturn_t vp_interrupt(int irq, void *opaque) return IRQ_NONE; /* Configuration change? Tell driver if it wants to know. */ - if (isr VIRTIO_PCI_ISR_CONFIG) { - struct virtio_driver *drv; - drv = container_of(vp_dev-vdev.dev.driver, - struct virtio_driver, driver); + if (isr VIRTIO_PCI_ISR_CONFIG) + vp_config_changed(irq, opaque); - if (drv drv-config_changed) - drv-config_changed(vp_dev-vdev); - } + return vp_vring_interrupt(irq, opaque); +} spin_lock_irqsave(vp_dev-lock, flags); list_for_each_entry(info, vp_dev-virtqueues, node) { -- 1.6.3.rc3.1.g830204 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio_pci: optional MSI-X support
This implements optional MSI-X support in virtio_pci. MSI-X is used whenever the host supports at least 2 MSI-X vectors: 1 for configuration changes and 1 for virtqueues. Per-virtqueue vectors are allocated if enough vectors available. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_pci.c | 209 +-- include/linux/virtio_pci.h |8 ++- 2 files changed, 190 insertions(+), 27 deletions(-) diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h index cd0fd5d..4a0275b 100644 --- a/include/linux/virtio_pci.h +++ b/include/linux/virtio_pci.h @@ -47,9 +47,15 @@ /* The bit of the ISR which indicates a device configuration change. */ #define VIRTIO_PCI_ISR_CONFIG 0x2 +/* MSI-X registers: only enabled if MSI-X is enabled. */ +/* A 16-bit vector for configuration changes. */ +#define VIRTIO_MSI_CONFIG_VECTOR20 +/* A 16-bit vector for selected queue notifications. */ +#define VIRTIO_MSI_QUEUE_VECTOR 22 + /* The remaining space is defined by each driver as the per-driver * configuration space */ -#define VIRTIO_PCI_CONFIG 20 +#define VIRTIO_PCI_CONFIG(dev) ((dev)-msix_enabled ? 24 : 20) /* Virtio ABI version, this must match exactly */ #define VIRTIO_PCI_ABI_VERSION 0 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index f7b79a2..2b6333c 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -42,6 +42,28 @@ struct virtio_pci_device /* a list of queues so we can dispatch IRQs */ spinlock_t lock; struct list_head virtqueues; + + /* MSI-X support */ + int msix_enabled; + struct msix_entry *msix_entries; + /* Name strings for interrupts. This size should be enough, +* and I'm too lazy to allocate each name separately. */ + char (*msix_names)[256]; + /* Number of vectors configured at startup (excludes per-virtqueue +* vectors if any) */ + unsigned msix_preset_vectors; + /* Number of per-virtqueue vectors if any. */ + unsigned msix_per_vq_vectors; +}; + +/* Constants for MSI-X */ +/* Use first vector for configuration changes, second and the rest for + * virtqueues Thus, we need at least 2 vectors for MSI. */ +enum { + VP_MSIX_CONFIG_VECTOR = 0, + VP_MSIX_VQ_VECTOR = 1, + VP_MSIX_MIN_VECTORS = 2, + VP_MSIX_NO_VECTOR = 0x, }; struct virtio_pci_vq_info @@ -60,6 +82,9 @@ struct virtio_pci_vq_info /* the list node for the virtqueues list */ struct list_head node; + + /* MSI-X vector (or none) */ + unsigned vector; }; /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ @@ -109,7 +134,8 @@ static void vp_get(struct virtio_device *vdev, unsigned offset, void *buf, unsigned len) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); - void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset; + void __iomem *ioaddr = vp_dev-ioaddr + + VIRTIO_PCI_CONFIG(vp_dev) + offset; u8 *ptr = buf; int i; @@ -123,7 +149,8 @@ static void vp_set(struct virtio_device *vdev, unsigned offset, const void *buf, unsigned len) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); - void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset; + void __iomem *ioaddr = vp_dev-ioaddr + + VIRTIO_PCI_CONFIG(vp_dev) + offset; const u8 *ptr = buf; int i; @@ -221,17 +248,110 @@ static irqreturn_t vp_interrupt(int irq, void *opaque) return vp_vring_interrupt(irq, opaque); } - spin_lock_irqsave(vp_dev-lock, flags); - list_for_each_entry(info, vp_dev-virtqueues, node) { - if (vring_interrupt(irq, info-vq) == IRQ_HANDLED) - ret = IRQ_HANDLED; +/* the config-free_vqs() implementation */ +static void vp_free_vectors(struct virtio_device *vdev) { + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + int i; + + /* Disable the vector used for configuration */ + iowrite16(VP_MSIX_NO_VECTOR, + vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR); + + for (i = 0; i vp_dev-msix_preset_vectors; ++i) + free_irq(vp_dev-msix_entries[i].vector, vp_dev); + + if (!vp_dev-msix_preset_vectors) + free_irq(vp_dev-pci_dev-irq, vp_dev); + + if (vp_dev-msix_enabled) { + vp_dev-msix_enabled = 0; + pci_disable_msix(vp_dev-pci_dev); } - spin_unlock_irqrestore(vp_dev-lock, flags); +} - return ret; +static int vp_request_vectors(struct virtio_device *vdev, unsigned max_vqs) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + const char *name = dev_name(vp_dev-vdev.dev); + unsigned i, vectors; + int err =
Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3
mtosa...@redhat.com wrote: Disallow the deletion of memory slots (and aliases, for x86 case), if a vcpu contains a cr3 that points to such slot/alias. That allows the guest to induce failures in the host. Better to triple-fault the guest instead. +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + return 1; +} + In general, instead of stubs in every arch, have x86 say KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when that define is not present. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Gregory Haskins wrote: One thing I was thinking here was that I could create a flag for the kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR. This flag when specified at creation time will cause the event to execute a clear operation instead of a set when triggered. That way, the default mode is an edge-triggered set. The non-default mode is to trigger a clear. Level-triggered ints could therefore create two irqfds, one for raising, the other for clearing. That's my second choice option. An alternative is to abandon the use of eventfd, and allow the irqfd to be a first-class anon-fd. The parameters passed to the write/signal() function could then indicate the desired level. The disadvantage would be that it would not be compatible with eventfd, so we would need to decide if the tradeoff is worth it. I would really like to keep using eventfd. Which is why I asked Davide about the prospects of direct callbacks (vs wakeups). OTOH, I suspect level triggered interrupts will be primarily in the legacy domain, so perhaps we do not need to worry about it too much. Therefore, another option is that we *could* simply set the stake in the ground that legacy/level cannot use irqfd. This is my preferred option. For a virtio-net-server in the kernel, we'd service its eventfd in qemu, raising and lowering the pci interrupt in the traditional way. But we'd still need to know when to lower the interrupt. How? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, 7 May 2009, Avi Kivity wrote: What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op-poll() subsystem. Did you mean something else? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Avi Kivity wrote: Gregory Haskins wrote: One thing I was thinking here was that I could create a flag for the kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR. This flag when specified at creation time will cause the event to execute a clear operation instead of a set when triggered. That way, the default mode is an edge-triggered set. The non-default mode is to trigger a clear. Level-triggered ints could therefore create two irqfds, one for raising, the other for clearing. That's my second choice option. An alternative is to abandon the use of eventfd, and allow the irqfd to be a first-class anon-fd. The parameters passed to the write/signal() function could then indicate the desired level. The disadvantage would be that it would not be compatible with eventfd, so we would need to decide if the tradeoff is worth it. I would really like to keep using eventfd. Which is why I asked Davide about the prospects of direct callbacks (vs wakeups). I saw that request. That would be ideal. OTOH, I suspect level triggered interrupts will be primarily in the legacy domain, so perhaps we do not need to worry about it too much. Therefore, another option is that we *could* simply set the stake in the ground that legacy/level cannot use irqfd. This is my preferred option. For a virtio-net-server in the kernel, we'd service its eventfd in qemu, raising and lowering the pci interrupt in the traditional way. But we'd still need to know when to lower the interrupt. How? IIUC, isn't that usually device/subsystem specific, and out of scope of the GSI delivery vehicle? For instance, most devices I have seen with level ints have a register in their device register namespace for acking the int. As an aside, this is what causes some of the grief in dealing with shared interrupts like KVM pass-through and/or threaded-isrs: There isn't a standardized way to ACK them. You may also see some generalization of masking/acking in things like the MSI-X table. But again, this would be out of scope of the general GSI delivery path IIUC. I understand that there is a feedback mechanism in the ioapic model for calling back on acknowledgment of the interrupt. But I am not sure what is how the real hardware works normally, and therefore I am not convinced that is something we need to feed all the way back (i.e. via irqfd or whatever). In the interest of full disclosure, its been a few years since I studied the xAPIC docs, so I might be out to lunch on that assertion. ;) -Greg signature.asc Description: OpenPGP digital signature
[PATCH 1/4] BIOS changes for configuring irq0-inti2 override (v2)
These patches resolve the irq0-inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0-inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0-inti2 via the irq routing interface, and adds the irq0-inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. Signed-off-by: Beth Kon e...@us.ibm.com diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index 8684987..07dda73 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -445,6 +445,9 @@ uint32_t cpuid_ext_features; unsigned long ram_size; uint64_t ram_end; uint8_t bios_uuid[16]; +#ifdef BX_QEMU +uint8_t irq0_override; +#endif #ifdef BX_USE_EBDA_TABLES unsigned long ebda_cur_addr; #endif @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e int qemu_cfg_port; @@ -518,6 +522,18 @@ void uuid_probe(void) memset(bios_uuid, 0, 16); } +#ifdef BX_QEMU +void irq0_override_probe(void) +{ +if(qemu_cfg_port) { +qemu_cfg_select(QEMU_CFG_IRQ0_OVERRIDE); +qemu_cfg_read(irq0_override, 1); +return; +} +memset(irq0_override, 0, 1); +} +#endif + void cpu_probe(void) { uint32_t eax, ebx, ecx, edx; @@ -1160,6 +1176,13 @@ static void mptable_init(void) /* irqs */ for(i = 0; i 16; i++) { +#ifdef BX_QEMU +/* One entry per ioapic interrupt destination. Destination 2 is covered + * by irq0-inti2 override (i == 0). Source IRQ 2 is unused + */ +if (irq0_override i == 2) +continue; +#endif putb(q, 3); /* entry type = I/O interrupt */ putb(q, 0); /* interrupt type = vectored interrupt */ putb(q, 0); /* flags: po=0, el=0 */ @@ -1167,7 +1190,12 @@ static void mptable_init(void) putb(q, 0); /* source bus ID = ISA */ putb(q, i); /* source bus IRQ */ putb(q, ioapic_id); /* dest I/O APIC ID */ -putb(q, i); /* dest I/O APIC interrupt in */ +#ifdef BX_QEMU +if (irq0_override i == 0) +putb(q, 2); /* dest I/O APIC interrupt in */ +else +#endif +putb(q, i); /* dest I/O APIC interrupt in */ } /* patch length */ len = q - mp_config_table; @@ -1550,16 +1578,18 @@ void acpi_bios_init(void) addr = (addr + 7) ~7; madt_addr = addr; +madt = (void *)(addr); madt_size = sizeof(*madt) + sizeof(struct madt_processor_apic) * MAX_CPUS + -#ifdef BX_QEMU -sizeof(struct madt_io_apic) /* + sizeof(struct madt_int_override) */; -#else sizeof(struct madt_io_apic); +#ifdef BX_QEMU +for (i = 0; i 16; i++) +if (PCI_ISA_IRQ_MASK (1U i)) +madt_size += sizeof(struct madt_int_override); +if (irq0_override) +madt_size += sizeof(struct madt_int_override); #endif -madt = (void *)(addr); addr += madt_size; - #ifdef BX_QEMU #ifdef HPET_WORKS_IN_KVM addr = (addr + 7) ~7; @@ -1660,23 +1690,20 @@ void acpi_bios_init(void) io_apic-io_apic_id = smp_cpus; io_apic-address = cpu_to_le32(0xfec0); io_apic-interrupt = cpu_to_le32(0); +int_override = (struct madt_int_override*)(io_apic + 1); #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM -io_apic++; - -int_override = (void *)io_apic; -int_override-type = APIC_XRUPT_OVERRIDE; -int_override-length = sizeof(*int_override); -int_override-bus = cpu_to_le32(0); -int_override-source = cpu_to_le32(0); -int_override-gsi = cpu_to_le32(2); -int_override-flags = cpu_to_le32(0); -#endif +if (irq0_override) { +memset(int_override, 0, sizeof(*int_override)); +int_override-type = APIC_XRUPT_OVERRIDE; +int_override-length = sizeof(*int_override); +int_override-source = 0; +int_override-gsi = 2; +int_override-flags = 0; /* conforms to bus specifications */ +int_override++; +} #endif - -int_override = (struct madt_int_override*)(io_apic + 1); for ( i = 0; i 16; i++ ) {
[PATCH 3/4] BIOS changes for KVM HPET (v2)
Just a note here... The number of table_offset_entry entries for the non BX_QEMU case doesn't make sense here. There are only 2 entries. I left it as is, since it does not impact HPET's interraction with it. Actually it seems like dead code since this is in kvm code but with BX_QEMU undefined. It doesn't seem to be a problem. Signed-off-by: Beth Kon e...@us.ibm.com diff --git a/kvm/bios/acpi-dsdt.dsl b/kvm/bios/acpi-dsdt.dsl index c756fed..0e142be 100755 --- a/kvm/bios/acpi-dsdt.dsl +++ b/kvm/bios/acpi-dsdt.dsl @@ -308,7 +308,6 @@ DefinitionBlock ( }) } #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM Device(HPET) { Name(_HID, EISAID(PNP0103)) Name(_UID, 0) @@ -328,7 +327,6 @@ DefinitionBlock ( }) } #endif -#endif } Scope(\_SB.PCI0) { diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index ddfa828..7441cd7 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -1293,7 +1293,7 @@ struct rsdt_descriptor_rev1 { ACPI_TABLE_HEADER_DEF /* ACPI common table header */ #ifdef BX_QEMU - uint32_t table_offset_entry [2]; /* Array of pointers to other */ + uint32_t table_offset_entry [3]; /* Array of pointers to other */ // uint32_t table_offset_entry [4]; /* Array of pointers to other */ #else uint32_t table_offset_entry [3]; /* Array of pointers to other */ @@ -1450,8 +1450,8 @@ struct acpi_20_generic_address { } __attribute__((__packed__)); /* - * * HPET Description Table - * */ + * HPET Description Table + */ struct acpi_20_hpet { ACPI_TABLE_HEADER_DEF /* ACPI common table header */ uint32_t timer_block_id; @@ -1591,13 +1591,11 @@ void acpi_bios_init(void) #endif addr += madt_size; #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM addr = (addr + 7) ~7; hpet_addr = addr; hpet = (void *)(addr); addr += sizeof(*hpet); #endif -#endif acpi_tables_size = addr - base_addr; @@ -1620,10 +1618,10 @@ void acpi_bios_init(void) memset(rsdt, 0, sizeof(*rsdt)); rsdt-table_offset_entry[0] = cpu_to_le32(fadt_addr); rsdt-table_offset_entry[1] = cpu_to_le32(madt_addr); -//rsdt-table_offset_entry[2] = cpu_to_le32(ssdt_addr); #ifdef BX_QEMU -//rsdt-table_offset_entry[3] = cpu_to_le32(hpet_addr); +rsdt-table_offset_entry[2] = cpu_to_le32(hpet_addr); #endif +//rsdt-table_offset_entry[3] = cpu_to_le32(ssdt_addr); acpi_build_table_header((struct acpi_table_header *)rsdt, RSDT, sizeof(*rsdt), 1); @@ -1723,7 +1721,6 @@ void acpi_bios_init(void) #ifdef BX_QEMU /* HPET */ -#ifdef HPET_WORKS_IN_KVM memset(hpet, 0, sizeof(*hpet)); /* Note timer_block_id value must be kept in sync with value advertised by * emulated hpet @@ -1733,7 +1730,6 @@ void acpi_bios_init(void) acpi_build_table_header((struct acpi_table_header *)hpet, HPET, sizeof(*hpet), 1); #endif -#endif } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Userspace changes for KVM HPET (v2)
Signed-off-by: Beth Kon e...@us.ibm.com diff --git a/hw/hpet.c b/hw/hpet.c index c7945ec..47c9f89 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -30,6 +30,7 @@ #include console.h #include qemu-timer.h #include hpet_emul.h +#include qemu-kvm.h //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -48,6 +49,43 @@ uint32_t hpet_in_legacy_mode(void) return 0; } +static void hpet_kpit_enable(void) +{ +struct kvm_pit_state ps; +kvm_get_pit(kvm_context, ps); +kvm_set_pit(kvm_context, ps); +} + +static void hpet_kpit_disable(void) +{ +struct kvm_pit_state ps; +kvm_get_pit(kvm_context, ps); +ps.channels[0].mode = 0xff; +kvm_set_pit(kvm_context, ps); +} + +static void hpet_legacy_enable(void) +{ +if (qemu_kvm_pit_in_kernel()) { + hpet_kpit_disable(); + dprintf(qemu: hpet disabled kernel pit\n); +} else { + hpet_pit_disable(); + dprintf(qemu: hpet disabled userspace pit\n); +} +} + +static void hpet_legacy_disable(void) +{ +if (qemu_kvm_pit_in_kernel()) { + hpet_kpit_enable(); + dprintf(qemu: hpet enabled kernel pit\n); +} else { + hpet_pit_enable(); + dprintf(qemu: hpet enabled userspace pit\n); +} +} + static uint32_t timer_int_route(struct HPETTimer *timer) { uint32_t route; @@ -475,9 +513,9 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr, } /* i8254 and RTC are disabled when HPET is in legacy mode */ if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) { -hpet_pit_disable(); +hpet_legacy_enable(); } else if (deactivating_bit(old_val, new_val, HPET_CFG_LEGACY)) { -hpet_pit_enable(); +hpet_legacy_disable(); } break; case HPET_CFG + 4: @@ -560,7 +598,7 @@ static void hpet_reset(void *opaque) { * hpet_reset is called due to system reset. At this point control must * be returned to pit until SW reenables hpet. */ -hpet_pit_enable(); +hpet_legacy_disable(); count = 1; } diff --git a/vl.c b/vl.c index f9a72b3..b860b82 100644 --- a/vl.c +++ b/vl.c @@ -6138,10 +6138,15 @@ int main(int argc, char **argv, char **envp) } if (kvm_enabled()) { - kvm_init_ap(); +kvm_init_ap(); #ifdef USE_KVM if (kvm_irqchip !qemu_kvm_has_gsi_routing()) { irq0override = 0; +/* if kernel can't do irq routing, interrupt source + * override 0-2 can not be set up as required by hpet, + * so disable hpet. + */ +no_hpet=1; } #endif } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM
commit 01813db8627e74018c8cec90df7e345839351f23 Author: root r...@xendinar01.amd.com Date: Thu May 7 09:44:10 2009 -0500 New AMD processors will support the Pause Filter Feature. This feature creates a new field in the VMCB called Pause Filter Count. If Pause Filter Count is greater than 0 and intercepting PAUSEs is enabled, the processor will increment an internal counter when a PAUSE instruction occurs instead of intercepting. When the internal counter reaches the Pause Filter Count value, a PAUSE intercept will occur. This feature can be used to detect contended spinlocks, especially when the lock holding VCPU is not scheduled. Rescheduling another VCPU prevents the VCPU seeking the lock from wasting its quantum by spinning idly. Experimental results show that most spinlocks are held for less than 1000 PAUSE cycles or more than a few thousand. Default the Pause Filter Counter to 3000 to detect the contended spinlocks. Processor support for this feature is indicated by a CPUID bit. On a 24 core system running 4 guests each with 16 VCPUs, this patch improved overall performance of each guest's 32 job kernbench by approximately 1%. Further performance improvement may be possible with a more sophisticated yield algorithm. -Mark Langsdorf Operating System Research Center AMD Signed-off-by: Mark Langsdorf mark.langsd...@amd.com diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 85574b7..1fecb7e 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -57,7 +57,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u16 intercept_dr_write; u32 intercept_exceptions; u64 intercept; - u8 reserved_1[44]; + u8 reserved_1[42]; + u16 pause_filter_count; u64 iopm_base_pa; u64 msrpm_base_pa; u64 tsc_offset; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ef43a18..4279141 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -45,6 +45,7 @@ MODULE_LICENSE(GPL); #define SVM_FEATURE_NPT (1 0) #define SVM_FEATURE_LBRV (1 1) #define SVM_FEATURE_SVML (1 2) +#define SVM_FEATURE_PAUSE_FILTER (1 10) #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) @@ -575,6 +576,12 @@ static void init_vmcb(struct vcpu_svm *svm) svm-nested_vmcb = 0; svm-vcpu.arch.hflags = HF_GIF_MASK; + + if (svm_has(SVM_FEATURE_PAUSE_FILTER)) { + control-pause_filter_count = 5000; + control-intercept |= (1ULL INTERCEPT_PAUSE); + } + } static int svm_vcpu_reset(struct kvm_vcpu *vcpu) @@ -2087,6 +2094,15 @@ static int interrupt_window_interception(struct vcpu_svm *svm, return 1; } +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + /* Simple yield */ + vcpu_put(svm-vcpu); + schedule(); + vcpu_load(svm-vcpu); + return 1; +} + static int (*svm_exit_handlers[])(struct vcpu_svm *svm, struct kvm_run *kvm_run) = { [SVM_EXIT_READ_CR0] = emulate_on_interception, @@ -2123,6 +2139,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_CPUID]= cpuid_interception, [SVM_EXIT_IRET] = iret_interception, [SVM_EXIT_INVD] = emulate_on_interception, + [SVM_EXIT_PAUSE]= pause_interception, [SVM_EXIT_HLT] = halt_interception, [SVM_EXIT_INVLPG] = invlpg_interception, [SVM_EXIT_INVLPGA] = invalid_op_interception, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2b73e19..e2b730d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -710,6 +710,7 @@ void vcpu_load(struct kvm_vcpu *vcpu) kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); } +EXPORT_SYMBOL_GPL(vcpu_load); void vcpu_put(struct kvm_vcpu *vcpu) { @@ -719,6 +720,7 @@ void vcpu_put(struct kvm_vcpu *vcpu) preempt_enable(); mutex_unlock(vcpu-mutex); } +EXPORT_SYMBOL_GPL(vcpu_put); static void ack_flush(void *_completed) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Userspace changes for configuring irq0-inti2 override (v2)
Signed-off-by: Beth Kon e...@us.ibm.com diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c index e1b19d7..bb74f38 100644 --- a/hw/fw_cfg.c +++ b/hw/fw_cfg.c @@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16); fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)nographic); fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus); +fw_cfg_add_i16(s, FW_CFG_IRQ0_OVERRIDE, (uint16_t)irq0override); register_savevm(fw_cfg, -1, 1, fw_cfg_save, fw_cfg_load, s); qemu_register_reset(fw_cfg_reset, s); diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h index f616ed2..498c1e3 100644 --- a/hw/fw_cfg.h +++ b/hw/fw_cfg.h @@ -15,6 +15,7 @@ #define FW_CFG_INITRD_SIZE 0x0b #define FW_CFG_BOOT_DEVICE 0x0c #define FW_CFG_NUMA 0x0d +#define FW_CFG_IRQ0_OVERRIDE0x0e #define FW_CFG_MAX_ENTRY0x10 #define FW_CFG_WRITE_CHANNEL0x4000 diff --git a/hw/ioapic.c b/hw/ioapic.c index 0b70cf6..2d77a2c 100644 --- a/hw/ioapic.c +++ b/hw/ioapic.c @@ -23,6 +23,7 @@ #include hw.h #include pc.h +#include sysemu.h #include qemu-timer.h #include host-utils.h @@ -95,14 +96,13 @@ void ioapic_set_irq(void *opaque, int vector, int level) { IOAPICState *s = opaque; -#if 0 /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps * to GSI 2. GSI maps to ioapic 1-1. This is not * the cleanest way of doing it but it should work. */ -if (vector == 0) +if (vector == 0 irq0override) { vector = 2; -#endif +} if (vector = 0 vector IOAPIC_NUM_PINS) { uint32_t mask = 1 vector; diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 8cb6faa..2e52c87 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -879,7 +879,11 @@ int kvm_arch_init_irq_routing(void) return r; } for (i = 0; i 24; ++i) { -r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i); +if (i == 0) { +r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, 2); +} else if (i != 2) { +r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i); +} if (r 0) return r; } diff --git a/qemu-kvm.h b/qemu-kvm.h index 8226001..c64718c 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -165,6 +165,7 @@ void qemu_kvm_cpu_stop(CPUState *env); #define kvm_enabled() (kvm_allowed) #define qemu_kvm_irqchip_in_kernel() kvm_irqchip_in_kernel(kvm_context) #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context) +#define qemu_kvm_has_gsi_routing() kvm_has_gsi_routing(kvm_context) #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu() void kvm_init_vcpu(CPUState *env); void kvm_load_tsc(CPUState *env); @@ -173,6 +174,7 @@ void kvm_load_tsc(CPUState *env); #define kvm_nested 0 #define qemu_kvm_irqchip_in_kernel() (0) #define qemu_kvm_pit_in_kernel() (0) +#define qemu_kvm_has_gsi_routing() (0) #define kvm_has_sync_mmu() (0) #define kvm_load_registers(env) do {} while(0) #define kvm_save_registers(env) do {} while(0) diff --git a/sysemu.h b/sysemu.h index 1f45fd6..292bbc3 100644 --- a/sysemu.h +++ b/sysemu.h @@ -93,6 +93,7 @@ extern int graphic_width; extern int graphic_height; extern int graphic_depth; extern int nographic; +extern int irq0override; extern const char *keyboard_layout; extern int win2k_install_hack; extern int rtc_td_hack; diff --git a/vl.c b/vl.c index 6b4b7d2..f9a72b3 100644 --- a/vl.c +++ b/vl.c @@ -207,6 +207,7 @@ static int vga_ram_size; enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB; static DisplayState *display_state; int nographic; +int irq0override; static int curses; static int sdl; const char* keyboard_layout = NULL; @@ -5039,6 +5040,7 @@ int main(int argc, char **argv, char **envp) vga_ram_size = VGA_RAM_SIZE; snapshot = 0; nographic = 0; +irq0override = 1; curses = 0; kernel_filename = NULL; kernel_cmdline = ; @@ -6135,8 +6137,14 @@ int main(int argc, char **argv, char **envp) } } -if (kvm_enabled()) - kvm_init_ap(); +if (kvm_enabled()) { + kvm_init_ap(); +#ifdef USE_KVM +if (kvm_irqchip !qemu_kvm_has_gsi_routing()) { +irq0override = 0; +} +#endif +} machine-init(ram_size, vga_ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Gregory Haskins wrote: This is my preferred option. For a virtio-net-server in the kernel, we'd service its eventfd in qemu, raising and lowering the pci interrupt in the traditional way. But we'd still need to know when to lower the interrupt. How? IIUC, isn't that usually device/subsystem specific, and out of scope of the GSI delivery vehicle? For instance, most devices I have seen with level ints have a register in their device register namespace for acking the int. Yes it is. As an aside, this is what causes some of the grief in dealing with shared interrupts like KVM pass-through and/or threaded-isrs: There isn't a standardized way to ACK them. So we'd need a side channel to tell userspace to lower the irq. Another eventfd likely. Note we don't support device assignment for devices with shared interrupts. You may also see some generalization of masking/acking in things like the MSI-X table. But again, this would be out of scope of the general GSI delivery path IIUC. I understand that there is a feedback mechanism in the ioapic model for calling back on acknowledgment of the interrupt. But I am not sure what is how the real hardware works normally, and therefore I am not convinced that is something we need to feed all the way back (i.e. via irqfd or whatever). In the interest of full disclosure, its been a few years since I studied the xAPIC docs, so I might be out to lunch on that assertion. ;) Right, that ack thing is completely internal, used for catching up on time drift, and for shutting down level triggered assigned interrupts. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM
(copying Ingo) Mark Langsdorf wrote: commit 01813db8627e74018c8cec90df7e345839351f23 Author: root r...@xendinar01.amd.com Date: Thu May 7 09:44:10 2009 -0500 New AMD processors will support the Pause Filter Feature. This feature creates a new field in the VMCB called Pause Filter Count. If Pause Filter Count is greater than 0 and intercepting PAUSEs is enabled, the processor will increment an internal counter when a PAUSE instruction occurs instead of intercepting. When the internal counter reaches the Pause Filter Count value, a PAUSE intercept will occur. This feature can be used to detect contended spinlocks, especially when the lock holding VCPU is not scheduled. Rescheduling another VCPU prevents the VCPU seeking the lock from wasting its quantum by spinning idly. Experimental results show that most spinlocks are held for less than 1000 PAUSE cycles or more than a few thousand. Default the Pause Filter Counter to 3000 to detect the contended spinlocks. Processor support for this feature is indicated by a CPUID bit. On a 24 core system running 4 guests each with 16 VCPUs, this patch improved overall performance of each guest's 32 job kernbench by approximately 1%. Further performance improvement may be possible with a more sophisticated yield algorithm. -Mark Langsdorf Operating System Research Center AMD Signed-off-by: Mark Langsdorf mark.langsd...@amd.com (please use git format-patch rather than git show, and set up user.name and user.email properly) svm-nested_vmcb = 0; svm-vcpu.arch.hflags = HF_GIF_MASK; + + if (svm_has(SVM_FEATURE_PAUSE_FILTER)) { + control-pause_filter_count = 5000; + control-intercept |= (1ULL INTERCEPT_PAUSE); + } + } 3000 or 5000? +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + /* Simple yield */ + vcpu_put(svm-vcpu); + schedule(); + vcpu_load(svm-vcpu); + return 1; + Ingo, will this do anything under CFS, or will CFS note that nothing has changed in the accounting and reschedule us immediately? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] BIOS changes for configuring irq0-inti2 override (v2)
Beth Kon wrote: These patches resolve the irq0-inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0-inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0-inti2 via the irq routing interface, and adds the irq0-inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. What are the changes relative to v1? @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e As noted, this should be in the arch local space. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Shared memory device with interrupt support
Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. This device now creates a qemu character device and sends 1-bytes messages to trigger interrupts. Writes are trigger by writing to the Doorbell register on the shared memory PCI device. The lower 8-bits of the value written to this register are sent as the 1-byte message so different meanings of interrupts can be supported. Interrupts are only supported between 2 VMs currently. One VM must act as the server by adding server to the command-line argument. Shared memory devices are created with the following command-line: -ivhshmem shm object,size in MB,[unix:path][,server] Interrupts can also be used between host and guest as well by implementing a listener on the host. Cam --- Makefile.target |3 + hw/ivshmem.c| 421 +++ hw/pc.c |6 + hw/pc.h |3 + qemu-options.hx | 14 ++ sysemu.h|8 + vl.c| 14 ++ 7 files changed, 469 insertions(+), 0 deletions(-) create mode 100644 hw/ivshmem.c diff --git a/Makefile.target b/Makefile.target index b68a689..3190bba 100644 --- a/Makefile.target +++ b/Makefile.target @@ -643,6 +643,9 @@ OBJS += pcnet.o OBJS += rtl8139.o OBJS += e1000.o +# Inter-VM PCI shared memory +OBJS += ivshmem.o + # Generic watchdog support and some watchdog devices OBJS += watchdog.o OBJS += wdt_ib700.o wdt_i6300esb.o diff --git a/hw/ivshmem.c b/hw/ivshmem.c new file mode 100644 index 000..95e2268 --- /dev/null +++ b/hw/ivshmem.c @@ -0,0 +1,421 @@ +/* + * Inter-VM Shared Memory PCI device. + * + * Author: + * Cam Macdonell c...@cs.ualberta.ca + * + * Based On: cirrus_vga.c and rtl8139.c + * + * This code is licensed under the GNU GPL v2. + */ + +#include hw.h +#include console.h +#include pc.h +#include pci.h +#include sysemu.h + +#include qemu-common.h +#include sys/mman.h + +#define PCI_COMMAND_IOACCESS0x0001 +#define PCI_COMMAND_MEMACCESS 0x0002 +#define PCI_COMMAND_BUSMASTER 0x0004 + +//#define DEBUG_IVSHMEM + +#ifdef DEBUG_IVSHMEM +#define IVSHMEM_DPRINTF(fmt, args...)\ +do {printf(IVSHMEM: fmt, ##args); } while (0) +#else +#define IVSHMEM_DPRINTF(fmt, args...) +#endif + +typedef struct IVShmemState { +uint16_t intrmask; +uint16_t intrstatus; +uint16_t doorbell; +uint8_t *ivshmem_ptr; +unsigned long ivshmem_offset; +unsigned int ivshmem_size; +unsigned long bios_offset; +unsigned int bios_size; +target_phys_addr_t base_ctrl; +int it_shift; +PCIDevice *pci_dev; +CharDriverState * chr; +unsigned long map_addr; +unsigned long map_end; +int ivshmem_mmio_io_addr; +} IVShmemState; + +typedef struct PCI_IVShmemState { +PCIDevice dev; +IVShmemState ivshmem_state; +} PCI_IVShmemState; + +typedef struct IVShmemDesc { +char name[1024]; +char * chrdev; +int size; +} IVShmemDesc; + + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 16, +Doorbell = 32 +}; + +static int num_ivshmem_devices = 0; +static IVShmemDesc ivshmem_desc; + +static void ivshmem_map(PCIDevice *pci_dev, int region_num, +uint32_t addr, uint32_t size, int type) +{ +PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev; +IVShmemState *s = d-ivshmem_state; + +IVSHMEM_DPRINTF(addr = %u size = %u\n, addr, size); +cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset); + +} + +void ivshmem_init(const char * optarg) { + +char * temp; +char * ivshmem_sz; +int size; + +num_ivshmem_devices++; + +/* currently we only support 1 device */ +if (num_ivshmem_devices MAX_IVSHMEM_DEVICES) { +return; +} + +temp = strdup(optarg); +snprintf(ivshmem_desc.name, 1024, /%s, strsep(temp,,)); +ivshmem_sz=strsep(temp,,); +if (ivshmem_sz != NULL){ +size = atol(ivshmem_sz); +} else { +size = -1; +} + +ivshmem_desc.chrdev = strsep(temp,\0); + +if ( size == -1) { +ivshmem_desc.size = TARGET_PAGE_SIZE; +} else { +ivshmem_desc.size = size*1024*1024; +} +IVSHMEM_DPRINTF(optarg is %s, name is %s, size is %d, chrdev is %s\n, +optarg, ivshmem_desc.name, +ivshmem_desc.size, ivshmem_desc.chrdev); +} + +int ivshmem_get_size(void) { +return ivshmem_desc.size; +} + +/* accessing registers - based on rtl8139 */ +static void ivshmem_update_irq(IVShmemState *s) +{ +int isr; +isr = (s-intrstatus s-intrmask) 0x; + +/* don't print ISR resets */ +if (isr) { +IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n, + isr
bridges
I'm trying to understand bridging with KVM, but am still puzzled. I think that the recommended bridging with TAP means that packets from the VM will end up going out the host card attached to the default gateway. But it looks to me as if their IP address is unchanged, which means replies will never reach me. Is that correct? Do I need to NAT the packets, or is something already doing that? Some documents indicate that I need to bring the interfaces (e.g., eth0) down before I bring the bridge up, and that afterwards only the bridge will have an IP address. Is that right? Some documents, e.g., http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate iptables should just work with bridging. However, I've seen someone with a 2.6.15 kernel ask about firewalling and be told they needed to patch the kernel to get it work (don't have the reference handy). Should it just work? I'm running a 2.6.29 kernel on Debian Lenny with kvm 72+dfsg-5~lenny1. Version 84+dfsg-2 is available in experimental. Is there much to be gained by going with the more recent version? Please cc me; I'm not on the list. Thanks. Ross Boylan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bios: Use a different mask to size the option ROM BAR
Bit 0 is the enable bit, which we not only don't want to set, but it will stick and make us think it's an I/O port resource. Signed-off-by: Alex Williamson alex.william...@hp.com --- bios/rombios32.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/bios/rombios32.c b/bios/rombios32.c index 269f175..616e70a 100644 --- a/bios/rombios32.c +++ b/bios/rombios32.c @@ -948,11 +948,13 @@ static void pci_bios_init_device(PCIDevice *d) int ofs; uint32_t val, size ; -if (i == PCI_ROM_SLOT) +if (i == PCI_ROM_SLOT) { ofs = 0x30; -else +pci_config_writel(d, ofs, 0xfffe); +} else { ofs = 0x10 + i * 4; -pci_config_writel(d, ofs, 0x); +pci_config_writel(d, ofs, 0x); +} val = pci_config_readl(d, ofs); if (val != 0) { size = (~(val ~0xf)) + 1; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Driver for Inter-VM shared memory device for KVM supporting interrupts.
Driver for inter-VM shared memory device that now supports interrupts between two guests. The driver defines a counting semaphore and wait_event queue for different synchronization needs of users. Initializing the semaphore count, sending interrupts and waiting are implemented via ioctl calls. The synchronization mechanisms are simple and rely on existing kernel primitives, but I think they're flexible for synchronization between guests. I'm contemplating more complicated designs that would use the shared memory to store synchronization variables, but thought I would get this initial patch out to get some feedback. Cheers, Cam --- drivers/char/Kconfig |8 + drivers/char/Makefile |2 + drivers/char/kvm_ivshmem.c | 430 3 files changed, 440 insertions(+), 0 deletions(-) create mode 100644 drivers/char/kvm_ivshmem.c diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..afa7cb8 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,14 @@ config DEVPORT depends on ISA || PCI default y +config KVM_IVSHMEM +tristate Inter-VM Shared Memory Device +depends on PCI +default m +help + This device maps a region of shared memory between the host OS and any + number of virtual machines. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..021f06b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o obj-$(CONFIG_JS_RTC) += js-rtc.o js-rtc-y = rtc.o +obj-$(CONFIG_KVM_IVSHMEM) += kvm_ivshmem.o + # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c new file mode 100644 index 000..a20a224 --- /dev/null +++ b/drivers/char/kvm_ivshmem.c @@ -0,0 +1,430 @@ +/* + * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI device + * + * Copyright 2009 Cam Macdonell c...@cs.ualberta.ca + * + * Based on cirrusfb.c and 8139cp.c: + * Copyright 1999-2001 Jeff Garzik + * Copyright 2001-2004 Jeff Garzik + * + */ + +#include linux/init.h +#include linux/kernel.h +#include linux/module.h +#include linux/pci.h +#include linux/proc_fs.h +#include linux/smp_lock.h +#include asm/uaccess.h +#include linux/interrupt.h +#include linux/mutex.h + +#define TRUE 1 +#define FALSE 0 +#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0 + +enum { +/* KVM Inter-VM shared memory device register offsets */ +IntrMask= 0x00,/* Interrupt Mask */ +IntrStatus = 0x10,/* Interrupt Status */ +Doorbell= 0x20,/* Doorbell */ +ShmOK = 1/* Everything is OK */ +}; + +typedef struct kvm_ivshmem_device { +void __iomem * regs; + +void * base_addr; + +unsigned int regaddr; +unsigned int reg_size; + +unsigned int ioaddr; +unsigned int ioaddr_size; +unsigned int irq; + +bool enabled; + +} kvm_ivshmem_device; + +static int event_num; +static struct semaphore sema; +static wait_queue_head_t wait_queue; + +static kvm_ivshmem_device kvm_ivshmem_dev; + +static int device_major_nr; + +static int kvm_ivshmem_ioctl(struct inode *, struct file *, unsigned int, unsigned long); +static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *); +static int kvm_ivshmem_open(struct inode *, struct file *); +static int kvm_ivshmem_release(struct inode *, struct file *); +static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *); +static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t *); +static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin); + +enum ivshmem_ioctl { set_sema, down_sema, sema_irq, wait_event, wait_event_irq }; + +static const struct file_operations kvm_ivshmem_ops = { +.owner = THIS_MODULE, +.open= kvm_ivshmem_open, +.mmap= kvm_ivshmem_mmap, +.read= kvm_ivshmem_read, +.ioctl = kvm_ivshmem_ioctl, +.write = kvm_ivshmem_write, +.llseek = kvm_ivshmem_lseek, +.release = kvm_ivshmem_release, +}; + +static struct pci_device_id kvm_ivshmem_id_table[] = { +{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, +{ 0 }, +}; +MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table); + +static void kvm_ivshmem_remove_device(struct pci_dev* pdev); +static int kvm_ivshmem_probe_device (struct pci_dev *pdev, +const struct pci_device_id * ent); + +static struct pci_driver kvm_ivshmem_pci_driver = { +.name= kvm-shmem, +.id_table= kvm_ivshmem_id_table, +.probe = kvm_ivshmem_probe_device, +.remove = kvm_ivshmem_remove_device, +}; + +static int kvm_ivshmem_ioctl(struct inode * ino, struct file * filp, +unsigned int cmd,
Re: [PATCH 1/4] BIOS changes for configuring irq0-inti2 override(v2)
Avi Kivity wrote: Beth Kon wrote: These patches resolve the irq0-inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0-inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0-inti2 via the irq routing interface, and adds the irq0-inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. What are the changes relative to v1? Just merge issues with the changes you put in when moving to the newer bios. I submitted prematurely, incorrectly thinking I was done testing. When I finished, some problems surfaced. @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e As noted, this should be in the arch local space. The base changes were not in the code yet. As we discussed on IRC, I'll resubmit once they're there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, 7 May 2009, Avi Kivity wrote: Davide Libenzi wrote: What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op-poll() subsystem. Did you mean something else? Do you mean wait_queue_t::func? Yes, it is since long time ago that a wakeup in Linux can lead either to a real wakeup (in scheduler terms), or to a simple function call. Isn't that enough? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: VF drivers can also have this issue (and typically use mmio). I at least have a better idea what your proposal is, thanks for explanation. Are you able to demonstrate concrete benefit with it yet (improved latency numbers for example)? I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. That would be useful, because I keep coming back to pio and shared page(s) when think of why not to do this. Seems I'm not alone in that. thanks, -chris I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' show me the numbers and Anthony's Why reinvent the wheel? questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. -Greg signature.asc Description: OpenPGP digital signature
[PATCH] kvm: device-assignment: Catch GSI overflow
Fix the index at which we return -ENOSPC since the kernel side will reject a GSI = KVM_MAX_IRQ_ROUTES. Also, mask as a signed int before testing for error. Signed-off-by: Alex Williamson alex.william...@hp.com --- hw/device-assignment.c |2 +- kvm/libkvm/libkvm.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a7365c8..e06dd08 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -796,7 +796,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, unsigned int ctrl_pos) pci_dev-cap.start + PCI_MSI_DATA_32); assigned_dev-entry-type = KVM_IRQ_ROUTING_MSI; assigned_dev-entry-gsi = kvm_get_irq_route_gsi(kvm_context); -if (assigned_dev-entry-gsi 0) { +if ((int)(assigned_dev-entry-gsi) 0) { perror(assigned_dev_update_msi: kvm_get_irq_route_gsi); return; } diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c index ba0a5d1..2a4165a 100644 --- a/kvm/libkvm/libkvm.c +++ b/kvm/libkvm/libkvm.c @@ -1408,7 +1408,7 @@ int kvm_get_irq_route_gsi(kvm_context_t kvm) { #ifdef KVM_CAP_IRQ_ROUTING if (kvm-max_used_gsi = KVM_IOAPIC_NUM_PINS) { - if (kvm-max_used_gsi = kvm_get_gsi_count(kvm)) + if (kvm-max_used_gsi + 1 kvm_get_gsi_count(kvm)) return kvm-max_used_gsi + 1; else return -ENOSPC; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: device-assignment: Catch GSI overflow
On Thu, 2009-05-07 at 11:09 -0600, Alex Williamson wrote: Fix the index at which we return -ENOSPC since the kernel side will reject a GSI = KVM_MAX_IRQ_ROUTES. Also, mask as a signed int before testing for error. Even with this, there still seems to be a fundamental problem with our consumption of GSIs in kvm. For example, every time a guest writes to the MSI capabilities area and enables MSI support, we call kvm_get_irq_route_gsi() to get a new max_used_gsi + 1 value, then call kvm_add_routing_entry(), which updates max_used_gsi. It doesn't take too long before we exhaust the GSI space and the device no longer works. This seems to happen within a minute or two of booting a guest with an e1000e device sitting idle on a busy network. Do we need to keep a bitmap of used GSIs or maybe just attempt to reuse the GSI we've gotten previously for the device? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bridges
On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote: Ross Boylan wrote: I'm trying to understand bridging with KVM, but am still puzzled. I think that the recommended bridging with TAP means that packets from the VM will end up going out the host card attached to the default gateway. But it looks to me as if their IP address is unchanged, which means replies will never reach me. Is that correct? Do I need to NAT the packets, or is something already doing that? Hi Ross, This is the place to start http://www.linux-kvm.org/page/Networking. I saw that; it gives some recipes but I wasn't sure what their effect was. You want a public bridge. I'm not sure what their and me mean in your email. In short, with bridging each VM has its own IP and that VM can be accessed directly from the network. their = the VM. me = my host machine. So if the VM's are running on their own subnet, e.g., 10.0.2.* (I've been assuming the subnet with TAP is like the one with the User Mode Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and my host machine is on another net, e.g., 10.0.8.* then I think the packet will go out with an IP of 10.0.2.2 (say). When some other machine tries to reply to 10.0.2.2, the packet gets lost because the outside network thinks 10.0.2.* is not for it. At least that's my concern. If the return IP address on the packet were 10.0.8.44 (supposing that's the IP of my host machine) then the packets could find their way back. My host machine is on an internal network with a 10.* IP. The example might be clearer if one supposed that the VM's were on a 192.168.* network. I am perhaps being influenced by the fact that I don't want to ask for more IP's, so I don't want to configure the VM's to use an IP on our 10.0.8 network. Does the TAP networking setup a whole subnet like the user mode network stack (e.g., running a DHCP server), or is the idea that I would just give the VM an IP on my subnet (10.0.8.*) in this example? If the latter is the case (I'm now suspecting it is) then I think the solution is clear. I just stick the VM's on a private (to my machine) subnet, like 192.168.*, and I do NAT on the packets as they go out. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' show me the numbers and Anthony's Why reinvent the wheel? questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. Well, 420 ns is not to be sneezed at. What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Gregory Haskins wrote: I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' show me the numbers and Anthony's Why reinvent the wheel? questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. Well, 420 ns is not to be sneezed at. What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Davide Libenzi wrote: On Thu, 7 May 2009, Avi Kivity wrote: Davide Libenzi wrote: What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op-poll() subsystem. Did you mean something else? Do you mean wait_queue_t::func? Yes, it is since long time ago that a wakeup in Linux can lead either to a real wakeup (in scheduler terms), or to a simple function call. Isn't that enough? It's perfect. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? Looks like it isn't there. But it isn't any different than set_pte - convert a write into a hypercall. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Gregory Haskins wrote: What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? Looks like it isn't there. But it isn't any different than set_pte - convert a write into a hypercall. I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking
On Wed, 2009-05-06 at 23:16 -0700, Han, Weidong wrote: @@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header *header, 0x%Lx\n, scope-enumeration_id, drhd-address); + bus = pci_find_bus(drhd-segment, scope-bus); + path = (struct acpi_dmar_pci_path *)(scope + 1); + count = (scope-length - + sizeof(struct acpi_dmar_device_scope)) + / sizeof(struct acpi_dmar_pci_path); + + while (count) { + if (pdev) + pci_dev_put(pdev); + + if (!bus) + break; + + pdev = pci_get_slot(bus, + PCI_DEVFN(path-dev, path-fn)); + if (!pdev) + break; ir_parse_ioapic_scope() happens very early in the boot. So, I don't think we can do the pci related discovery here. thanks, suresh + + path++; + count--; + bus = pdev-subordinate; + } + + if (pdev) { /* PCI discoverable IOAPIC*/ + ir_ioapic[ir_ioapic_num].bus = + pdev-bus-number; + ir_ioapic[ir_ioapic_num].devfn = pdev-devfn; + } else { /* Not PCI discoverable IOAPIC */ + if (!bus) + ir_ioapic[ir_ioapic_num].bus = + scope-bus; + else + ir_ioapic[ir_ioapic_num].bus = + bus-number; + ir_ioapic[ir_ioapic_num].devfn = + PCI_DEVFN(path-dev, path-fn); + } + ir_ioapic[ir_ioapic_num].iommu = iommu; ir_ioapic[ir_ioapic_num].id = scope-enumeration_id; ir_ioapic_num++; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. That's my thinking exactly. Note we can cheat further. kvm already has a coalesced mmio feature where side-effect-free mmios are collected in the kernel and passed to userspace only when some other significant event happens. We could pass those addresses to the guest and let it queue those writes itself, avoiding the hypercall completely. Though it's probably pointless: if the guest is paravirtualized enough to have the mmio hypercall, then it shouldn't be using e1000. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Note we can cheat further. kvm already has a coalesced mmio feature where side-effect-free mmios are collected in the kernel and passed to userspace only when some other significant event happens. We could pass those addresses to the guest and let it queue those writes itself, avoiding the hypercall completely. Though it's probably pointless: if the guest is paravirtualized enough to have the mmio hypercall, then it shouldn't be using e1000. Yeah...plus at least for my vbus purposes, all my my guest-host transitions are explicitly to cause side-effects, or I wouldn't be doing them in the first place ;) I suspect virtio-pci is exactly the same. I.e. the coalescing has already been done at a higher layer for platforms running PV code. Still a cool feature, tho. -Greg signature.asc Description: OpenPGP digital signature
qemu: kvm: avoid harmless unhandled wrmsr 0xc0010117 messages
Olders kernel which don't contain kvm.git commit 61a6bd672bda3b9468bf5895c1be085c4e481138 display the following message: kvm: 32301: cpu0 unhandled wrmsr: 0xc0010117 data 0 When kvm_arch_load_regs is called. This is confusing in bug reports. Avoid it by checking whether the host advertises the MSR, similarly to how MSR_STAR is handled. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 06ef775..5d19705 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -25,6 +25,7 @@ static struct kvm_msr_list *kvm_msr_list; extern unsigned int kvm_shadow_memory; static int kvm_has_msr_star; +static int kvm_has_vm_hsave_pa; static int lm_capable_kernel; @@ -54,10 +55,14 @@ int kvm_arch_qemu_create_context(void) kvm_msr_list = kvm_get_msr_list(kvm_context); if (!kvm_msr_list) return -1; -for (i = 0; i kvm_msr_list-nmsrs; ++i) +for (i = 0; i kvm_msr_list-nmsrs; ++i) { if (kvm_msr_list-indices[i] == MSR_STAR) kvm_has_msr_star = 1; - return 0; +if (kvm_msr_list-indices[i] == MSR_VM_HSAVE_PA) +kvm_has_vm_hsave_pa = 1; +} + +return 0; } static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index, @@ -260,7 +265,8 @@ void kvm_arch_load_regs(CPUState *env) set_msr_entry(msrs[n++], MSR_IA32_SYSENTER_EIP, env-sysenter_eip); if (kvm_has_msr_star) set_msr_entry(msrs[n++], MSR_STAR, env-star); -set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); +if (kvm_has_vm_hsave_pa) +set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave); #ifdef TARGET_X86_64 if (lm_capable_kernel) { set_msr_entry(msrs[n++], MSR_CSTAR, env-cstar); @@ -435,7 +441,8 @@ void kvm_arch_save_regs(CPUState *env) if (kvm_has_msr_star) msrs[n++].index = MSR_STAR; msrs[n++].index = MSR_IA32_TSC; -msrs[n++].index = MSR_VM_HSAVE_PA; +if (kvm_has_vm_hsave_pa) +msrs[n++].index = MSR_VM_HSAVE_PA; #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3
On Thu, May 07, 2009 at 05:16:35PM +0300, Avi Kivity wrote: mtosa...@redhat.com wrote: Disallow the deletion of memory slots (and aliases, for x86 case), if a vcpu contains a cr3 that points to such slot/alias. That allows the guest to induce failures in the host. I don't understand what you mean. What is the problem with returning errors in the ioctl handlers? The guest can cause an overflow in qemu, overwrite the parameters to KVM_GET_MSR_INDEX_LIST in an attempt to read kernel data, and get -E2BIG. Or pick your combination. Better to triple-fault the guest instead. Sure can additionally triple fault it, but the kernel might attempt to access the non-existant slot which cr3 points to before TRIPLE_FAULT is processed. So you have to avoid that possibility in the first place, thats why the patch modifies the ioctls to fail. +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ +return 1; +} + In general, instead of stubs in every arch, have x86 say KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when that define is not present. Will fix that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Avi Kivity wrote: Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Gregory Haskins wrote: Avi Kivity wrote: Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? -Greg signature.asc Description: OpenPGP digital signature
Re: bridges
Ross Boylan wrote: On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote: Ross Boylan wrote: I'm trying to understand bridging with KVM, but am still puzzled. I think that the recommended bridging with TAP means that packets from the VM will end up going out the host card attached to the default gateway. But it looks to me as if their IP address is unchanged, which means replies will never reach me. Is that correct? Do I need to NAT the packets, or is something already doing that? Hi Ross, This is the place to start http://www.linux-kvm.org/page/Networking. I saw that; it gives some recipes but I wasn't sure what their effect was. You want a public bridge. I'm not sure what their and me mean in your email. In short, with bridging each VM has its own IP and that VM can be accessed directly from the network. their = the VM. me = my host machine. So if the VM's are running on their own subnet, VMs do not run on their own subnet with bridged networking. e.g., 10.0.2.* (I've been assuming the subnet with TAP is like the one with the User Mode Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and my host machine is on another net, e.g., 10.0.8.* then I think the packet will go out with an IP of 10.0.2.2 (say). When some other machine tries to reply to 10.0.2.2, the packet gets lost because the outside network thinks 10.0.2.* is not for it. At least that's my concern. If the return IP address on the packet were 10.0.8.44 (supposing that's the IP of my host machine) then the packets could find their way back. Using bridged networking is very different from the user stack. The user stack is extremely limited and slow. My host machine is on an internal network with a 10.* IP. The example might be clearer if one supposed that the VM's were on a 192.168.* network. I am perhaps being influenced by the fact that I don't want to ask for more IP's, so I don't want to configure the VM's to use an IP on our 10.0.8 network. Then you probably want to use a NAT network. A NAT setup puts all the VMs on their own network within the host machine. iptables is necessary to forward the subnet packets out to the world and back. Here is some older documentation, but not much has changed. Look at the first entry under Advanced Networking. https://help.ubuntu.com/community/KVMFeisty Does the TAP networking setup a whole subnet like the user mode network stack (e.g., running a DHCP server), or is the idea that I would just give the VM an IP on my subnet (10.0.8.*) in this example? No, bridge networking using taps (one tap per VM) and effectively sits all the VMs on the same network your host is on. You would need to get IPs from sysadmin for each VM. If the latter is the case (I'm now suspecting it is) then I think the solution is clear. I just stick the VM's on a private (to my machine) subnet, like 192.168.*, and I do NAT on the packets as they go out. NAT is a very common solution. Use VDE (vde.sourceforget.net) to create a virtual switch on your host for the VMs. dnsmasq can serve dynamic IPs to the VMs on their own subnet that doesn't bother your sysadmin at all. Use iptables to forward and receive packets through your host's NIC. Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). I think we just past the too complicated threshold. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the nullio ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). Right, I get the part about flagging the address and then keying off that flag in IO_COND (like we do for PIO vs MMIO). What I am not clear on is how you would know to flag the address to begin with. Here's a thought: PV drivers can flag the IO (e.g. virtio-pci knows it would never be a real device). This means we route any io requests from virtio-pci though pv_io_ops-mmio(), but not unflagged addresses. This is not as slick as boosting *everyones* mmio speed as Avi's original idea would have, but it is perhaps a good tradeoff between the entirely new namespace created by my original dynhc() proposal and leaving them all PF based. This way, its just like using my dynhc() proposal except the mmio-addr is the substitute address-token (instead of the dynhc-vector). Additionally, if you do not PV the kernel the IO_COND/pv_io_op is ignored and it just slow-paths through the PF as it does today. Dynhc() would be dependent on pv_ops. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Don't - it's broken. It will also catch device assignment mmio and hypercall them. Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. And btw, given that eventfd and the underlying infrastructure are so flexible, it's probably better to go back to your original irqfd gets fd from userspace just to be consistent everywhere. (no, I'm not deliberately making you rewrite that patch again and again... it's going to be a key piece of infrastructure so I want to get it right) Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: I think we just past the too complicated threshold. And the can't spel threshold in the same sentence. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the slow path are an acceptable casualty, I think we can make that work. Thoughts? An mmio that goes through a PF is a bug, it's certainly broken on a number of platforms, so performance should not be an issue there. Note that are four commonly used interface classes for PIO/MMIO: 1. readl/writel: little-endian MMIO 2. inl/outl: little-endian PIO 3. ioread32/iowrite32: converged little-endian PIO/MMIO 4. __raw_readl/__raw_writel: native-endian MMIO without checks You don't need to worry about the __raw_* stuff, as this should never be used in device drivers. As a simplification, you could mandate that all drivers that want to use this get converted to the ioread/iowrite class of interfaces and leave the others slow. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html