date:20090507

[PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking

2009-05-07 Thread Weidong Han

To support domain-isolation usages, the platform hardware must be
capable of uniquely identifying the requestor (source-id) for each
interrupt message. Without source-id checking for interrupt remapping
, a rouge guest/VM with assigned devices can launch interrupt attacks
to bring down anothe guest/VM or the VMM itself.

This patch adds source-id checking for interrupt remapping, and then
really isolates interrupts for guests/VMs with assigned devices.

Signed-off-by: Weidong Han weidong@intel.com
---
 arch/x86/kernel/apic/io_apic.c |6 +++
 drivers/pci/intr_remapping.c   |   98 
 drivers/pci/intr_remapping.h   |2 +
 include/linux/dmar.h   |   11 +
 4 files changed, 117 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 30da617..3d10c68 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1559,6 +1559,9 @@ int setup_ioapic_entry(int apic_id, int irq,
irte.vector = vector;
irte.dest_id = IRTE_DEST(destination);
 
+   /* Set source-id of interrupt request */
+   set_ioapic_sid(irte, apic_id);
+
modify_irte(irq, irte);
 
ir_entry-index2 = (index  15)  0x1;
@@ -3329,6 +3332,9 @@ static int msi_compose_msg(struct pci_dev *pdev, unsigned 
int irq, struct msi_ms
irte.vector = cfg-vector;
irte.dest_id = IRTE_DEST(dest);
 
+   /* Set source-id of interrupt request */
+   set_msi_sid(irte, pdev);
+
modify_irte(irq, irte);
 
msg-address_hi = MSI_ADDR_BASE_HI;
diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c
index 946e170..825bca2 100644
--- a/drivers/pci/intr_remapping.c
+++ b/drivers/pci/intr_remapping.c
@@ -10,6 +10,7 @@
 #include linux/intel-iommu.h
 #include intr_remapping.h
 #include acpi/acpi.h
+#include pci.h
 
 static struct ioapic_scope ir_ioapic[MAX_IO_APICS];
 static int ir_ioapic_num;
@@ -405,6 +406,61 @@ int free_irte(int irq)
return rc;
 }
 
+int set_ioapic_sid(struct irte *irte, int apic)
+{
+   int i;
+   u16 sid = 0;
+
+   if (!irte)
+   return -1;
+
+   for (i = 0; i  MAX_IO_APICS; i++)
+   if (ir_ioapic[i].id == apic) {
+   sid = (ir_ioapic[i].bus  8) | ir_ioapic[i].devfn;
+   break;
+   }
+
+   if (sid == 0) {
+   printk(KERN_WARNING Failed to set source-id of 
+  I/O APIC (%d), because it is not under 
+  any DRHD\n, apic);
+   return -1;
+   }
+
+   irte-svt = 1; /* requestor ID verification SID/SQ */
+   irte-sq = 0;  /* comparing all 16-bit of SID */
+   irte-sid = sid;
+
+   return 0;
+}
+
+int set_msi_sid(struct irte *irte, struct pci_dev *dev)
+{
+   struct pci_dev *tmp;
+
+   if (!irte || !dev)
+   return -1;
+
+   tmp = pci_find_upstream_pcie_bridge(dev);
+   if (!tmp) { /* PCIE device or integrated PCI device */
+   irte-svt = 1; /* verify requestor ID verification SID/SQ */
+   irte-sq = 0;  /* comparing all 16-bit of SID */
+   irte-sid = (dev-bus-number  8) | dev-devfn;
+   return 0;
+   }
+
+   if (tmp-is_pcie) { /* this is a PCIE-to-PCI bridge */
+   irte-svt = 2; /* verify request ID verification SID */
+   irte-sid = (tmp-bus-number  8) | dev-bus-number;
+   } else { /* this is a legacy PCI bridge */
+   irte-svt = 1; /* verify requestor ID verification SID/SQ */
+   irte-sq = 0;  /* comparing all 16-bit of SID */
+   irte-sid = (tmp-bus-number  8) | tmp-devfn;
+   }
+
+   return 0;
+}
+
 static void iommu_set_intr_remapping(struct intel_iommu *iommu, int mode)
 {
u64 addr;
@@ -616,6 +672,10 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header 
*header,
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_device_scope *scope;
void *start, *end;
+   struct acpi_dmar_pci_path *path;
+   struct pci_bus *bus;
+   struct pci_dev *pdev = NULL;
+   int count;
 
drhd = (struct acpi_dmar_hardware_unit *)header;
 
@@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header 
*header,
0x%Lx\n, scope-enumeration_id,
   drhd-address);
 
+   bus = pci_find_bus(drhd-segment, scope-bus);
+   path = (struct acpi_dmar_pci_path *)(scope + 1);
+   count = (scope-length -
+sizeof(struct acpi_dmar_device_scope))
+   / sizeof(struct acpi_dmar_pci_path);
+
+   while (count) {
+   if (pdev)
+

[PATCH 1/2] Intel-IOMMU, intr-remap: set the whole 128bits of irte when modify/free it

2009-05-07 Thread Weidong Han

Interrupt remapping table entry is 128bits. Currently, it only sets low
64bits of irte in modify_irte and free_irte. This ignores high 64bits
setting of irte, that means source-id setting will be ignored. This patch
sets the whole 128bits of irte when modify/free it. Following source-id
checking patch depends on this.

Signed-off-by: Weidong Han weidong@intel.com
---
 drivers/pci/intr_remapping.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/intr_remapping.c b/drivers/pci/intr_remapping.c
index f5e0ea7..946e170 100644
--- a/drivers/pci/intr_remapping.c
+++ b/drivers/pci/intr_remapping.c
@@ -309,7 +309,8 @@ int modify_irte(int irq, struct irte *irte_modified)
index = irq_iommu-irte_index + irq_iommu-sub_handle;
irte = iommu-ir_table-base[index];
 
-   set_64bit((unsigned long *)irte, irte_modified-low);
+   set_64bit((unsigned long *)irte-low, irte_modified-low);
+   set_64bit((unsigned long *)irte-high, irte_modified-high);
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 
rc = qi_flush_iec(iommu, index, 0);
@@ -386,8 +387,11 @@ int free_irte(int irq)
irte = iommu-ir_table-base[index];
 
if (!irq_iommu-sub_handle) {
-   for (i = 0; i  (1  irq_iommu-irte_mask); i++)
-   set_64bit((unsigned long *)(irte + i), 0);
+   for (i = 0; i  (1  irq_iommu-irte_mask); i++) {
+   set_64bit((unsigned long *)irte-low, 0);
+   set_64bit((unsigned long *)irte-high, 0);
+   irte++;
+   }
rc = qi_flush_iec(iommu, index, irq_iommu-irte_mask);
}
 
-- 
1.6.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: tg3 driver in guest fails for VT-d passthrough NIC

2009-05-07 Thread Han, Weidong

Yuji virtualized more capabilities to guest in Xen Qemu. I think your tg3 
driver can work in Xen guest with VT-d passthrough. If anyone interests, he/she 
can port the code from Xen Qemu to KVM Qemu.

Regards,
Weidong

Alex Williamson wrote:
 On Wed, May 6, 2009 at 11:31 AM, Nadolski, Ed ed.nadol...@lsi.com
 wrote: 
 Hi,
 
 I am running the KVM kernel  userspace downloaded on 4/24 with
 Fedora 10 host Linux on a Dell T7400 Xeon with VT-x and VT-d
 enabled. The T7400 has an onboard Broadcom NIC, but when I use VT-d
 to assign this NIC to a Fedora 10 guest, the NIC's tg3 driver in the
 guest aborts because it cannot find the PM Capability in the
 device's PCI config space. ... 
 So has anyone else seen this, and what is the right way to address
 it?  It's not good to simply pass thru certain device Capabilities
 if they cannot be properly handled by KVM/QEMU, but OTOH the
 unmodified guest OS drivers should not break because they were
 written to expect certain Capabilities.  Is there any plan to add
 support for these missing Capabilities? 
 
 Yes, I've seen this with bnx2 devices.  For a Linux guest, it's a
 fairly trivial change to the driver to not depend on these
 capabilities, but it would be far more compatible with existing driver
 code out there to make an attempt to support support them in KVM.
 AFAIK, it's not being working on yet.
 
 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: Question about KVM and PC speaker

2009-05-07 Thread Jan Kiszka

Sebastian Herbszt wrote:
 Jan Kiszka wrote:
 Sebastian Herbszt wrote:
 Simon Bienlein wrote:
 Is a support for BIOS worked on right now?

 The vgabios (vgabios.c) has a FIXME should beep. Volker, do you plan
 to fix this?

 Which frequency should be used for the beep? Which delay?

 I would try 1 KHz and some hundred milliseconds.
 
 I just looked at some vga bios and it uses about 896,45 Hz.
 
 Getting a delay using inb(0x61)  0x10 is still a no go on qemu,
 right?

 Looks like (should be far too inaccurate for longer delays). What about
 0x40:0x6c, the BIOS' daily timer counter?
 
 The bios i looked at used the refresh request port 0x61. This is
 supported by bochs and
 there is also a patch for qemu to replace the dummy [1]. The rombios
 uses this to provide
 INT 15h AH=86h functionality; this is likely broken with the dummy code
 in qemu.

I see no problem with improving qemu's emulation accuracy this way a
bit. But I wouldn't built new delay implementations on top of it,
specifically if the code is in fact aware of running over a hypervisor.
Such micro-timings are far too inaccurate for longer delays in an
environment where you cannot be sure of running all the time during that
period.

 
 Anyway, using timer ticks since midnight should be possible (INT 08h
 handler is set up
 before vga bios is called).
 
 [1] http://article.gmane.org/gmane.comp.emulators.bochs.devel/7594
 
 - Sebastian
 

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen


Zhang, Xiantao wrote:

--- qemu-kvm.orig/vl.c
+++ qemu-kvm/vl.c
@@ -4417,13 +4417,11 @@
 }
 if (cpu_can_run(env))
 ret = qemu_cpu_exec(env);
-#ifndef CONFIG_GDBSTUB
^^^
Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined.  


Yes, I made get_set_top_cpu() a noop instead when compiled with
CONFIG_GDBSTUB disabled, that way we avoid a bunch of these silly
#ifdefs.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Michael S. Tsirkin

pci_enable_msix currently returns -EINVAL if you ask
for more vectors than supported by the device, which would
typically cause fallback to regular interrupts.

It's better to return the table size, making the driver retry
MSI-X with less vectors.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Hi Jesse,
This came up when I was adding MSI-X support to virtio pci driver,
which does not know the exact table size upfront.
Could you consider this patch for 2.6.31 please?


 drivers/pci/msi.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 6f2e629..f5bd1c9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev)
  * indicates the successful configuration of MSI-X capability structure
  * with new allocated MSI-X irqs. A return of  0 indicates a failure.
  * Or a return of  0 indicates that driver request is exceeding the number
- * of irqs available. Driver should use the returned value to re-send
- * its request.
+ * of irqs or MSI-X vectors available. Driver should use the returned value to
+ * re-send its request.
  **/
 int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 {
@@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry 
*entries, int nvec)
 
nr_entries = pci_msix_table_size(dev);
if (nvec  nr_entries)
-   return -EINVAL;
+   return nr_entries;
 
/* Check for any invalid entries */
for (i = 0; i  nvec; i++) {
-- 
1.6.3.rc3.1.g830204
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2506814 ] TAP network lockup after some traffic

2009-05-07 Thread SourceForge.net

Bugs item #2506814, was opened at 2009-01-14 11:38
Message generated for change (Comment added) made by mellen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Fabio Coatti (cova)
Assigned to: Nobody/Anonymous (nobody)
Summary: TAP network lockup after some traffic

Initial Comment:
Hi all,
we are experiencing severe network troubles using kvm+tap networking.
basically, after some network load (we have yet to identify the exact amount of 
traffic if one exist) network stops working. 
During lockups, With tcpdump we can see arp requests on guest interface, then 
on tap, brX and physical interfaces on host system.
the arp answers can be seen, with tcpdump, only on physical host interface and 
bridge (brX), but not on tap device. Basically it seems that packets coming 
from external network get lost in tap device on the way to guest (kvm). Looking 
at tap data with ifconfig, the only weird thing is the TX packets overrun count 
that is  0. every time the network stops working, overrun count increases.
This has been observed with several kvm releases (for sure, 
76/77/78/79/80/81/82) and with different kernels (tried with some versions 
among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side.
we tried several network drivers (virtio, e1000, rtl) and all shows the same 
problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has 
acceptable performance)
(btw: on host machine we have vlan on top of ethX devices)
cpu number on guest makes no difference.
we tried with vanilla kernel provided kvm modules and with kvm package provided 
modules.
guest: 32 bit
host: 64 bit
host machine:

2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo.

Of course I can provide more details or perform other tests and try patches, if 
someone can give me some hints and advices they will be most welcome.

Thanks.




--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-07 10:39

Message:
Just happened again. Seems like it stops generating interrupts on virtio
devices:

 10:   3001  51199   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

doing an /bin/ls froze the guest for a minute or so until a CTRL-C got
through. After /bin/ls:

 10:   3001  51220   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

kvm_stat:
efer_reload0 0
exits 19150288617  1082
fpu_reload 11976730134
halt_exits 544721221   194
halt_wakeup300187634   141
host_state_reload  837279034   259
hypercalls4991133618 0
insn_emulation4797940911   709
insn_emulation_fail0 0
invlpg 0 0
io_exits   28125446265
irq_exits  173205458 0
irq_window 0 0
largepages 0 0
mmio_exits720457 0
mmu_cache_miss 290275800 0
mmu_flooded289036380 0
mmu_pde_zapped 170308104 0
mmu_pte_updated   6239297430 0
mmu_pte_write 20470585377 0
mmu_recycled   0 0
mmu_shadow_zapped  335324686 0
pf_fixed  8324827214 0
pf_guest  2229555421 0
remote_tlb_flush   283772247 0
request_irq0 0
signal_exits   6 0
tlb_flush 533453055666

... but there are 6 other guests on this host running just fine.

Can't connect gdb to the kvm's gdbserver. It just says Remote 'g' packet
reply is too long: 

Issuing system_reset stalled for a minute, then rebooted the guest. After
reboot, guest is find again.

--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-06 17:07

Message:
I'm curious about the status of this issue?

I'm experiencing the same problem apparently randomly on guests.
Restarting the network interface does not seem to fix the problem. Only a
reboot (or system_reset in kvm/qemu console) solves it.

Last time it happened was on a host with kvm-84 and guest with kernel
2.6.27 using virtio-net. Leading up to the stall it had a traffic load of
about 5 mbit constantly up and down for just over 2 hours with one 12 mbit
spikes.

I did not check interface counters or network traces at the time.


--

Comment By: Fabio Coatti (cova)

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote:
 pci_enable_msix currently returns -EINVAL if you ask
 for more vectors than supported by the device, which would
 typically cause fallback to regular interrupts.

 It's better to return the table size, making the driver retry
 MSI-X with less vectors.

Hi Michael

I think driver should read from capability list to know how many vector 
supported by this device before enable MSI-X for device, as 
pci_msix_table_size() did...

-- 
regards
Yang, Sheng


 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---

 Hi Jesse,
 This came up when I was adding MSI-X support to virtio pci driver,
 which does not know the exact table size upfront.
 Could you consider this patch for 2.6.31 please?


  drivers/pci/msi.c |6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)

 diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
 index 6f2e629..f5bd1c9 100644
 --- a/drivers/pci/msi.c
 +++ b/drivers/pci/msi.c
 @@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev)
   * indicates the successful configuration of MSI-X capability structure
   * with new allocated MSI-X irqs. A return of  0 indicates a failure.
   * Or a return of  0 indicates that driver request is exceeding the
 number - * of irqs available. Driver should use the returned value to
 re-send - * its request.
 + * of irqs or MSI-X vectors available. Driver should use the returned
 value to + * re-send its request.
   **/
  int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int
 nvec) {
 @@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct
 msix_entry *entries, int nvec)

   nr_entries = pci_msix_table_size(dev);
   if (nvec  nr_entries)
 - return -EINVAL;
 + return nr_entries;

   /* Check for any invalid entries */
   for (i = 0; i  nvec; i++) {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2506814 ] TAP network lockup after some traffic

2009-05-07 Thread SourceForge.net

Bugs item #2506814, was opened at 2009-01-14 11:38
Message generated for change (Comment added) made by cova
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Fabio Coatti (cova)
Assigned to: Nobody/Anonymous (nobody)
Summary: TAP network lockup after some traffic

Initial Comment:
Hi all,
we are experiencing severe network troubles using kvm+tap networking.
basically, after some network load (we have yet to identify the exact amount of 
traffic if one exist) network stops working. 
During lockups, With tcpdump we can see arp requests on guest interface, then 
on tap, brX and physical interfaces on host system.
the arp answers can be seen, with tcpdump, only on physical host interface and 
bridge (brX), but not on tap device. Basically it seems that packets coming 
from external network get lost in tap device on the way to guest (kvm). Looking 
at tap data with ifconfig, the only weird thing is the TX packets overrun count 
that is  0. every time the network stops working, overrun count increases.
This has been observed with several kvm releases (for sure, 
76/77/78/79/80/81/82) and with different kernels (tried with some versions 
among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side.
we tried several network drivers (virtio, e1000, rtl) and all shows the same 
problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has 
acceptable performance)
(btw: on host machine we have vlan on top of ethX devices)
cpu number on guest makes no difference.
we tried with vanilla kernel provided kvm modules and with kvm package provided 
modules.
guest: 32 bit
host: 64 bit
host machine:

2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo.

Of course I can provide more details or perform other tests and try patches, if 
someone can give me some hints and advices they will be most welcome.

Thanks.




--

Comment By: Fabio Coatti (cova)
Date: 2009-05-07 10:51

Message:
We are still biten by this issue. I'm running out of ideas, but if someone
can give me some hints on how to track down the problem or at least collect
more information I'll try it.

Thanks.

--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-07 10:39

Message:
Just happened again. Seems like it stops generating interrupts on virtio
devices:

 10:   3001  51199   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

doing an /bin/ls froze the guest for a minute or so until a CTRL-C got
through. After /bin/ls:

 10:   3001  51220   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

kvm_stat:
efer_reload0 0
exits 19150288617  1082
fpu_reload 11976730134
halt_exits 544721221   194
halt_wakeup300187634   141
host_state_reload  837279034   259
hypercalls4991133618 0
insn_emulation4797940911   709
insn_emulation_fail0 0
invlpg 0 0
io_exits   28125446265
irq_exits  173205458 0
irq_window 0 0
largepages 0 0
mmio_exits720457 0
mmu_cache_miss 290275800 0
mmu_flooded289036380 0
mmu_pde_zapped 170308104 0
mmu_pte_updated   6239297430 0
mmu_pte_write 20470585377 0
mmu_recycled   0 0
mmu_shadow_zapped  335324686 0
pf_fixed  8324827214 0
pf_guest  2229555421 0
remote_tlb_flush   283772247 0
request_irq0 0
signal_exits   6 0
tlb_flush 533453055666

... but there are 6 other guests on this host running just fine.

Can't connect gdb to the kvm's gdbserver. It just says Remote 'g' packet
reply is too long: 

Issuing system_reset stalled for a minute, then rebooted the guest. After
reboot, guest is find again.

--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-06 17:07

Message:
I'm curious about the status of this issue?

I'm experiencing the same problem apparently randomly on guests.
Restarting the network interface does not seem to fix the problem. Only a
reboot (or system_reset in kvm/qemu console) solves it.

Last time it happened was on a host with kvm-84 and guest with kernel

[patch] fix - do not build blobs when blobs are not needed

2009-05-07 Thread Jes Sorensen


Hi,

Don't like ugly assembler errors because QEMU tries to build some x86
assembly code which isn't needed on non-x86.

Jes
Do not try to build extboot when INSTALL_BLOBS is not set.

Signed-off-by: Jes Sorensen j...@sgi.com

---
 Makefile  |2 ++
 configure |1 +
 2 files changed, 3 insertions(+)

Index: qemu-kvm/Makefile
===
--- qemu-kvm.orig/Makefile
+++ qemu-kvm/Makefile
@@ -419,6 +419,7 @@
 
 .PHONY: kvm/extboot
 
+ifdef INSTALL_BLOBS
 all: kvm/extboot
 
 kvm/extboot:
@@ -427,3 +428,4 @@
|| ! cmp -s pc-bios/extboot.bin $@/extboot.bin; then \
 		cp $@/extboot.bin pc-bios/extboot.bin; \
 	fi
+endif
Index: qemu-kvm/configure
===
--- qemu-kvm.orig/configure
+++ qemu-kvm/configure
@@ -356,6 +356,7 @@
  cpu_emulation=no
  gdbstub=no
  slirp=no
+ blobs=no
 fi
 if [ $cpu = powerpc ]; then
  kvm=yes

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 17:05:06 Michael S. Tsirkin wrote:
 On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote:
  On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote:
   pci_enable_msix currently returns -EINVAL if you ask
   for more vectors than supported by the device, which would
   typically cause fallback to regular interrupts.
  
   It's better to return the table size, making the driver retry
   MSI-X with less vectors.
 
  Hi Michael
 
  I think driver should read from capability list to know how many vector
  supported by this device before enable MSI-X for device, as
  pci_msix_table_size() did...

 Drivers can do this, but it's more code. Since pci_enable_msix
 calls pci_msix_table_size already, let it do the work. Right?

If you don't know the vectors number before you enable MSI-X, how can you 
setup vectors? 

I don't think it's proper way to go.

-- 
regards
Yang, Sheng


  --
  regards
  Yang, Sheng
 
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
  
   Hi Jesse,
   This came up when I was adding MSI-X support to virtio pci driver,
   which does not know the exact table size upfront.
   Could you consider this patch for 2.6.31 please?
  
  
drivers/pci/msi.c |6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
  
   diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
   index 6f2e629..f5bd1c9 100644
   --- a/drivers/pci/msi.c
   +++ b/drivers/pci/msi.c
   @@ -687,8 +687,8 @@ int pci_msix_table_size(struct pci_dev *dev)
 * indicates the successful configuration of MSI-X capability
   structure * with new allocated MSI-X irqs. A return of  0 indicates a
   failure. * Or a return of  0 indicates that driver request is
   exceeding the number - * of irqs available. Driver should use the
   returned value to re-send - * its request.
   + * of irqs or MSI-X vectors available. Driver should use the returned
   value to + * re-send its request.
 **/
int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries,
   int nvec) {
   @@ -704,7 +704,7 @@ int pci_enable_msix(struct pci_dev* dev, struct
   msix_entry *entries, int nvec)
  
 nr_entries = pci_msix_table_size(dev);
 if (nvec  nr_entries)
   - return -EINVAL;
   + return nr_entries;
  
 /* Check for any invalid entries */
 for (i = 0; i  nvec; i++) {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix - do not build blobs when blobs are not needed

2009-05-07 Thread Avi Kivity


Jes Sorensen wrote:

Hi,

Don't like ugly assembler errors because QEMU tries to build some x86
assembly code which isn't needed on non-x86.


.PHONY: kvm/extboot

+ifdef INSTALL_BLOBS
all: kvm/extboot

kvm/extboot:
@@ -427,3 +428,4 @@
   || ! cmp -s pc-bios/extboot.bin $@/extboot.bin; then \
cp $@/extboot.bin pc-bios/extboot.bin; \
fi
+endif
Index: qemu-kvm/configure
===
--- qemu-kvm.orig/configure
+++ qemu-kvm/configure
@@ -356,6 +356,7 @@
 cpu_emulation=no
 gdbstub=no
 slirp=no
+ blobs=no
fi
if [ $cpu = powerpc ]; then
 kvm=yes


Seems a bit heavy handed.  blobs != extboot.

How about instead:

   build-targets-x86 = kvm/extboot

   all: $(build-targets-$(ARCH))

(or something similar that builds)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Matthew Wilcox

On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote:
 On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote:
  pci_enable_msix currently returns -EINVAL if you ask
  for more vectors than supported by the device, which would
  typically cause fallback to regular interrupts.
 
  It's better to return the table size, making the driver retry
  MSI-X with less vectors.
 
 Hi Michael
 
 I think driver should read from capability list to know how many vector 
 supported by this device before enable MSI-X for device, as 
 pci_msix_table_size() did...

I think Michael's patch makes sense.  It reduces the amount of work the
driver has to do without requiring any additional work in the core.  I
don't see the disadvantage to it.

Reviewed-by: Matthew Wilcox wi...@linux.intel.com

-- 
Matthew Wilcox  Intel Open Source Technology Centre
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Avi Kivity


Zhang, Xiantao wrote:

Jes Sorensen wrote:
  

Hi,

The latest changes to qemu-kvm breaks miserably if one tries to build
without CONFIG_GDBSTUB.

Jes



--- qemu-kvm.orig/vl.c
+++ qemu-kvm/vl.c
@@ -4417,13 +4417,11 @@
 }
 if (cpu_can_run(env))
 ret = qemu_cpu_exec(env);
-#ifndef CONFIG_GDBSTUB
^^^
Don't know why change #ifdef to #ifndef in upstream, and I remember it should be ifdef before. I believe this stuff should be compiled only if CONFIG_GDBSTUB is defined.  
  


This was introduced by

commit 704aec581c1683750e313832ba3aa4813d59cbd0
Author: Xiantao Zhang xiantao.zh...@intel.com
Date:   Thu Nov 27 17:23:27 2008 +0800

   Build fix for !CONFIG_GDBSTUB case
  
   Once CONFIG_GDBSTUB not configured, compile will generate error


In upstream.  Please fix it in upstream and qemu-kvm.git will get the 
fix from there.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Michael S. Tsirkin

On Thu, May 07, 2009 at 05:10:46PM +0800, Sheng Yang wrote:
   I think driver should read from capability list to know how many vector
   supported by this device before enable MSI-X for device, as
   pci_msix_table_size() did...
 
  Drivers can do this, but it's more code. Since pci_enable_msix
  calls pci_msix_table_size already, let it do the work. Right?
 
 If you don't know the vectors number before you enable MSI-X, how can you 
 setup vectors? 

I don't know how many irqs I will be able to get anyway.
vectors that can't be assigned are useless ...

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix irq routing by moving it to the correct place

2009-05-07 Thread Avi Kivity


Jes Sorensen wrote:

Hi,

Some freak :-) put x86 specific irq routing into the generic code path,
which obviously won't work on non-x86 systems.

This patch creates kvm_arch_init_irq_routing() and moves the x86
gibberish to it's correct location :-)



Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix - do not build blobs when blobs are not needed

2009-05-07 Thread Jes Sorensen


Avi Kivity wrote:

Seems a bit heavy handed.  blobs != extboot.

How about instead:

   build-targets-x86 = kvm/extboot

   all: $(build-targets-$(ARCH))

(or something similar that builds)


There's no problem that cannot be fixed by applying an appropriately
large hammer  :-)

This one works for me.

Jes

Do not try to build extboot on non-x86.

Signed-off-by: Jes Sorensen j...@sgi.com

---
 Makefile |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: qemu-kvm/Makefile
===
--- qemu-kvm.orig/Makefile
+++ qemu-kvm/Makefile
@@ -419,7 +419,10 @@
 
 .PHONY: kvm/extboot
 
-all: kvm/extboot
+build-targets-x86  = kvm/extboot
+build-targets-ia64 =
+
+all: $(build-targets-$(ARCH))
 
 kvm/extboot:
 	$(MAKE) -C $@

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 17:27:31 Matthew Wilcox wrote:
 On Thu, May 07, 2009 at 04:51:24PM +0800, Sheng Yang wrote:
  On Thursday 07 May 2009 16:28:41 Michael S. Tsirkin wrote:
   pci_enable_msix currently returns -EINVAL if you ask
   for more vectors than supported by the device, which would
   typically cause fallback to regular interrupts.
  
   It's better to return the table size, making the driver retry
   MSI-X with less vectors.
 
  Hi Michael
 
  I think driver should read from capability list to know how many vector
  supported by this device before enable MSI-X for device, as
  pci_msix_table_size() did...

 I think Michael's patch makes sense.  It reduces the amount of work the
 driver has to do without requiring any additional work in the core.  I
 don't see the disadvantage to it.

 Reviewed-by: Matthew Wilcox wi...@linux.intel.com

It's indeed weird. Why the semantic of pci_enable_msix can be changed to 
enable msix, or tell me how many vector do you have? You can simply call 
pci_msix_table_size() to get what you want, also without any more work, no? I 
can't understand...

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity


Davide Libenzi wrote:

On Wed, 6 May 2009, Gregory Haskins wrote:

  
I think we are ok in this regard (at least in v5) without the callback. 
kvm holds irqfd, which holds eventfd.  In a normal situation, we will

have eventfd with 2 references.  If userspace closes the eventfd, it
will drop 1 of the 2 eventfd file references, but the object should
remain intact as long as kvm still holds it as well.  When the kvm-fd is
released, we will then decouple from the eventfd-wqh and drop the last
fput(), officially freeing it.

Likewise, if kvm is closed before the eventfd, we will simply decouple
from the wqh and fput(eventfd), leaving the last reference held by
userspace until it closes as well.

Let me know if you see any holes in that.



Looks OK, modulo my knowledge of KVM internals.
  


What's your take on adding irq context safe callbacks to irqfd?

To give some background here, we would like to use eventfd as a generic 
connector between components, so the components do not know about each 
other.  So far eventfd successfully abstracts among components in the 
same process, in different processes, and in the kernel.


eventfd_signal() can be safely called from irq context, and will wake up 
a waiting task.  But in some cases, if the consumer is in the kernel, it 
may be able to consume the event from irq context, saving a context switch.


So, will you consider patches adding this capability to eventfd?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix - do not build blobs when blobs are not needed

2009-05-07 Thread Avi Kivity


Jes Sorensen wrote:

There's no problem that cannot be fixed by applying an appropriately
large hammer  :-)

This one works for me.


applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Zhang, Xiantao

Avi Kivity wrote:
 Zhang, Xiantao wrote:
 Jes Sorensen wrote:
 
 Hi,
 
 The latest changes to qemu-kvm breaks miserably if one tries to
 build without CONFIG_GDBSTUB. 
 
 Jes
 
 
 --- qemu-kvm.orig/vl.c
 +++ qemu-kvm/vl.c
 @@ -4417,13 +4417,11 @@
  }
  if (cpu_can_run(env))
  ret = qemu_cpu_exec(env);
 -#ifndef CONFIG_GDBSTUB
 ^^^
 Don't know why change #ifdef to #ifndef in upstream, and I remember
 it should be ifdef before. I believe this stuff should be compiled
 only if CONFIG_GDBSTUB is defined.  
 
 
 This was introduced by
 
 commit 704aec581c1683750e313832ba3aa4813d59cbd0
 Author: Xiantao Zhang xiantao.zh...@intel.com
 Date:   Thu Nov 27 17:23:27 2008 +0800
 
 Build fix for !CONFIG_GDBSTUB case
 
 Once CONFIG_GDBSTUB not configured, compile will generate error
 
 In upstream.  Please fix it in upstream and qemu-kvm.git will get the
 fix from there.

The original patch is 8c4379cc made by me in kvm-userspace.git and I used 
#ifdef, but accidently changed to #ifndef in qemu-kvm.git. So maybe this typo 
is introduced by Anthony Liguori when rewrote this patch for qemu-kvm.git ?  
Xiantao


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Matthew Wilcox

On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote:
 It's indeed weird. Why the semantic of pci_enable_msix can be changed to 
 enable msix, or tell me how many vector do you have? You can simply call 
 pci_msix_table_size() to get what you want, also without any more work, no? I 
 can't understand...

Here's a good example.  Let's suppose you have a driver which supports
two different models of cards, one has 16 MSI-X interrupts, the other
has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
card is model A, you get 16 interrupts.  If your card is model B, it says
you can have 10.

This is less work in the driver (since it must implement falling back to
a smaller number of interrupts *anyway*) than interrogating the card to
find out how many interrupts there are, then requesting the right number,
and still having the fallback path which is going to be less tested.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Avi Kivity


Zhang, Xiantao wrote:

Avi Kivity wrote:
  

Zhang, Xiantao wrote:


Jes Sorensen wrote:

  

Hi,

The latest changes to qemu-kvm breaks miserably if one tries to
build without CONFIG_GDBSTUB. 


Jes



--- qemu-kvm.orig/vl.c
+++ qemu-kvm/vl.c
@@ -4417,13 +4417,11 @@
 }
 if (cpu_can_run(env))
 ret = qemu_cpu_exec(env);
-#ifndef CONFIG_GDBSTUB
^^^
Don't know why change #ifdef to #ifndef in upstream, and I remember
it should be ifdef before. I believe this stuff should be compiled
only if CONFIG_GDBSTUB is defined.  

  

This was introduced by

commit 704aec581c1683750e313832ba3aa4813d59cbd0
Author: Xiantao Zhang xiantao.zh...@intel.com
Date:   Thu Nov 27 17:23:27 2008 +0800

Build fix for !CONFIG_GDBSTUB case

Once CONFIG_GDBSTUB not configured, compile will generate error

In upstream.  Please fix it in upstream and qemu-kvm.git will get the
fix from there.



The original patch is 8c4379cc made by me in kvm-userspace.git and I used #ifdef, but accidently changed to #ifndef in qemu-kvm.git. So maybe this typo is introduced by Anthony Liguori when rewrote this patch for qemu-kvm.git ?  
Xiantao
  


That commit got translated to fe538e6f in qemu-kvm.git.

Anthony, what's the reason for the difference between fe538e6f and 704aec58?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] qemu-kvm define cpu_has_work() for ia64

2009-05-07 Thread Avi Kivity


Jes Sorensen wrote:

Hi,

This one is needed to build on ia64.



Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-net: Fix save/load

2009-05-07 Thread Avi Kivity


Alex Williamson wrote:

Thanks Avi, I think you got the important bits.  I might recommend the
following change to keep this contained to a version_id 7 load and bail
if vnet header support is required, but not available.  Thanks,

Alex


kvm: virtio-net: Cleanup load from vnet header saved image

Bail if the saved image requires vnet header support.
  


Thanks, applied.  I even thought of this while applying the original 
patch but I guess my attention span is in a downward spiral so I forgot 
about it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote:
 On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote:
  It's indeed weird. Why the semantic of pci_enable_msix can be changed to
  enable msix, or tell me how many vector do you have? You can simply
  call pci_msix_table_size() to get what you want, also without any more
  work, no? I can't understand...

 Here's a good example.  Let's suppose you have a driver which supports
 two different models of cards, one has 16 MSI-X interrupts, the other
 has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
 card is model A, you get 16 interrupts.  If your card is model B, it says
 you can have 10.

 This is less work in the driver (since it must implement falling back to
 a smaller number of interrupts *anyway*) than interrogating the card to
 find out how many interrupts there are, then requesting the right number,
 and still having the fallback path which is going to be less tested.

Yeah, partly understand now.

But the confusing of return value is not that pleasure compared to this 
benefit. And even you have to fall back if return  0 anyway, but in the past, 
you just need fall back once at most; but now you may fall back twice. This 
make thing more complex - you need either two ifs or a simple loop. And just 
one if can deal with it before. All that required is one call for 
pci_msix_table_size(), and I believe most driver would like to know how much 
vector it have before it fill the vectors, so mostly no extra cost. But for 
this ambiguous return meaning, you have to add more code for fall back - yes, 
the driver may can assert that the positive return value always would be irq 
numbers if it call pci_msix_table_size() before, but is it safe in logic?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] kvm-s390: use hrtimer for clock wakeup from idle

2009-05-07 Thread Avi Kivity


Christian Borntraeger wrote:

Why can't this be done from the timer context (after adjusting the locks)?



It can be done, in fact I did that in my first version. The thing is, we would 
need to change all local_int.locks to spinlock_irqs, since standard timers are 
softirqs and hrtimers and hardirqs. Enabling and disabling irqs is a relatively 
expensive operating on s390 (but locks via compare and swap are quite cheap). 
Since we take this specific lock in lots of places this lock is on some hot 
pathes. The idle wakeup on the other hand is not that critical.
  


Makes sense.

Back when irqsave/irqrestore were expensive on x86 (P4 days, I think it 
was ~100 cycles) there was talk of using a software irq disable flag 
instead of using the hardware support.  So


- local_irq_disable() sets a flag

- interrupt handlers check the flag, if set they schedule the interrupt 
handler and return immediately


- local_irq_restore() clears the flag and fires and scheduled handlers

Now these operations are pretty cheap on x86, but maybe that can apply 
to s390.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Michael Ellerman

On Thu, 2009-05-07 at 03:53 -0600, Matthew Wilcox wrote:
 On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote:
  It's indeed weird. Why the semantic of pci_enable_msix can be changed to 
  enable msix, or tell me how many vector do you have? You can simply call 
  pci_msix_table_size() to get what you want, also without any more work, no? 
  I 
  can't understand...
 
 Here's a good example.  Let's suppose you have a driver which supports
 two different models of cards, one has 16 MSI-X interrupts, the other
 has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
 card is model A, you get 16 interrupts.  If your card is model B, it says
 you can have 10.
 
 This is less work in the driver (since it must implement falling back to
 a smaller number of interrupts *anyway*) than interrogating the card to
 find out how many interrupts there are, then requesting the right number,
 and still having the fallback path which is going to be less tested.

Not to mention that there's no guarantee that you'll get as many
interrupts as the device supports, so you should really be coding to
cope with that anyway. Like the example in MSI-HOWTO.txt:

197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
198 {
199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) {
200 rc = pci_enable_msix(adapter-pdev,
201  adapter-msix_entries, nvec);
202 if (rc  0)
203 nvec = rc;
204 else
205 return rc;
206 }
207 
208 return -ENOSPC;
209 }

So I agree, this patch is an improvement.

cheers


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 18:23:50 Michael Ellerman wrote:
 On Thu, 2009-05-07 at 03:53 -0600, Matthew Wilcox wrote:
  On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote:
   It's indeed weird. Why the semantic of pci_enable_msix can be changed
   to enable msix, or tell me how many vector do you have? You can
   simply call pci_msix_table_size() to get what you want, also without
   any more work, no? I can't understand...
 
  Here's a good example.  Let's suppose you have a driver which supports
  two different models of cards, one has 16 MSI-X interrupts, the other
  has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
  card is model A, you get 16 interrupts.  If your card is model B, it says
  you can have 10.
 
  This is less work in the driver (since it must implement falling back to
  a smaller number of interrupts *anyway*) than interrogating the card to
  find out how many interrupts there are, then requesting the right number,
  and still having the fallback path which is going to be less tested.

 Not to mention that there's no guarantee that you'll get as many
 interrupts as the device supports, so you should really be coding to
 cope with that anyway. Like the example in MSI-HOWTO.txt:

 197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int
 nvec) 198 {
 199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) {
 200 rc = pci_enable_msix(adapter-pdev,
 201  adapter-msix_entries, nvec);
 202 if (rc  0)
 203 nvec = rc;
 204 else
 205 return rc;
 206 }
 207
 208 return -ENOSPC;
 209 }

 So I agree, this patch is an improvement.

Oh yeah.

Forgot irq counts can also be changed from time to time.

OK, there should be a loop, so that's fine. :)

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] kvm-s390: use hrtimer for clock wakeup from idle

2009-05-07 Thread Christian Borntraeger

Am Thursday 07 May 2009 12:19:32 schrieb Avi Kivity:
 Makes sense.
 
 Back when irqsave/irqrestore were expensive on x86 (P4 days, I think it 
 was ~100 cycles) there was talk of using a software irq disable flag 
 instead of using the hardware support.  So
 
 - local_irq_disable() sets a flag
 
 - interrupt handlers check the flag, if set they schedule the interrupt 
 handler and return immediately
 
 - local_irq_restore() clears the flag and fires and scheduled handlers
 
 Now these operations are pretty cheap on x86, but maybe that can apply 
 to s390.

Interesting idea. Nevertheless, I dont think it improve our situation. The 
affected instructions (stosm and stnsm) are more expensive than compare and 
swap, but its nowhere near the ~100 cycles of P4.

Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Michael S. Tsirkin

On Thu, May 07, 2009 at 06:19:53PM +0800, Sheng Yang wrote:
 On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote:
  On Thu, May 07, 2009 at 05:40:15PM +0800, Sheng Yang wrote:
   It's indeed weird. Why the semantic of pci_enable_msix can be changed to
   enable msix, or tell me how many vector do you have? You can simply
   call pci_msix_table_size() to get what you want, also without any more
   work, no? I can't understand...
 
  Here's a good example.  Let's suppose you have a driver which supports
  two different models of cards, one has 16 MSI-X interrupts, the other
  has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
  card is model A, you get 16 interrupts.  If your card is model B, it says
  you can have 10.
 
  This is less work in the driver (since it must implement falling back to
  a smaller number of interrupts *anyway*) than interrogating the card to
  find out how many interrupts there are, then requesting the right number,
  and still having the fallback path which is going to be less tested.
 
 Yeah, partly understand now.
 
 But the confusing of return value is not that pleasure compared to this 
 benefit. And even you have to fall back if return  0 anyway, but in the 
 past, 
 you just need fall back once at most; but now you may fall back twice.

I don't think that's right - you might not be able to get the
number of interrupts that pci_enable_msix reported.

 This 
 make thing more complex - you need either two ifs or a simple loop. And just 
 one if can deal with it before. All that required is one call for 
 pci_msix_table_size(), and I believe most driver would like to know how much 
 vector it have before it fill the vectors, so mostly no extra cost. But for 
 this ambiguous return meaning, you have to add more code for fall back - yes, 
 the driver may can assert that the positive return value always would be irq 
 numbers if it call pci_msix_table_size() before, but is it safe in logic?

If you know how many vectors the card has, then the only failure mode
is when you are out of irqs. No change there.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Avi Kivity


Michael Ellerman wrote:

Not to mention that there's no guarantee that you'll get as many
interrupts as the device supports, so you should really be coding to
cope with that anyway. Like the example in MSI-HOWTO.txt:

197 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
198 {
199 while (nvec = FOO_DRIVER_MINIMUM_NVEC) {
200 rc = pci_enable_msix(adapter-pdev,
201  adapter-msix_entries, nvec);
202 if (rc  0)
203 nvec = rc;
204 else
205 return rc;
206 }
207 
208 return -ENOSPC;

209 }

So I agree, this patch is an improvement.
  


I imagine this loop is present in many drivers.  So why not add a helper

static int pci_enable_msix_min(struct foo_adapter *adapter, int min_nvec)
{
int nvec = 2048;

   while (nvec = min_nvec) {
   rc = pci_enable_msix(adapter-pdev,
adapter-msix_entries, nvec);
   if (rc == 0)
   return nvec;
   else if (rc  0)
   nvec = rc;
   else
   return rc;
   }
   return -ENOSPC;
}



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka

Avi Kivity wrote:
 Zhang, Xiantao wrote:
 Jes Sorensen wrote:
  
 Hi,

 The latest changes to qemu-kvm breaks miserably if one tries to build
 without CONFIG_GDBSTUB.

 Jes
 

 --- qemu-kvm.orig/vl.c
 +++ qemu-kvm/vl.c
 @@ -4417,13 +4417,11 @@
  }
  if (cpu_can_run(env))
  ret = qemu_cpu_exec(env);
 -#ifndef CONFIG_GDBSTUB
 ^^^
 Don't know why change #ifdef to #ifndef in upstream, and I remember it
 should be ifdef before. I believe this stuff should be compiled only
 if CONFIG_GDBSTUB is defined.
 
 This was introduced by
 
 commit 704aec581c1683750e313832ba3aa4813d59cbd0
 Author: Xiantao Zhang xiantao.zh...@intel.com
 Date:   Thu Nov 27 17:23:27 2008 +0800
 
Build fix for !CONFIG_GDBSTUB case
  Once CONFIG_GDBSTUB not configured, compile will generate error
 
 In upstream.  Please fix it in upstream and qemu-kvm.git will get the
 fix from there.

Given that CONFIG_GDBSTUB is always true upstream, I tend to say:
Finally drop it upstream and keep this (or a more arch-local) workaround
for missing ia64 gdbstub support downstream.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Avi Kivity


Jan Kiszka wrote:

In upstream.  Please fix it in upstream and qemu-kvm.git will get the
fix from there.



Given that CONFIG_GDBSTUB is always true upstream, I tend to say:
Finally drop it upstream and keep this (or a more arch-local) workaround
for missing ia64 gdbstub support downstream.
  


I agree, unless

- we want to make gdbstub support configurable (don't see any 
overwhelming reason for this, but maybe others do)
- we want to merge ia64 kvm support upstream, and don't want to impose 
gdbstub support (though I'd recommend properly implementing gdbstub)


In any case, I'm okay with dropping the check upstream and applying the 
local fixup.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka

Avi Kivity wrote:
 Jan Kiszka wrote:
 In upstream.  Please fix it in upstream and qemu-kvm.git will get the
 fix from there.
 

 Given that CONFIG_GDBSTUB is always true upstream, I tend to say:
 Finally drop it upstream and keep this (or a more arch-local) workaround
 for missing ia64 gdbstub support downstream.
   
 
 I agree, unless
 
 - we want to make gdbstub support configurable (don't see any
 overwhelming reason for this, but maybe others do)
 - we want to merge ia64 kvm support upstream, and don't want to impose
 gdbstub support (though I'd recommend properly implementing gdbstub)
 
 In any case, I'm okay with dropping the check upstream and applying the
 local fixup.

The last ia64 tcg patch I saw one or two days ago was missing gdb
support, too (and may suffer from some brokenness for that reason). But
I would suggest for both cases (tcg and kvm) to temporarily provide
stubs/placeholders instead of keeping the central #ifdefs.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Avi Kivity


Jan Kiszka wrote:

I agree, unless

- we want to make gdbstub support configurable (don't see any
overwhelming reason for this, but maybe others do)
- we want to merge ia64 kvm support upstream, and don't want to impose
gdbstub support (though I'd recommend properly implementing gdbstub)

In any case, I'm okay with dropping the check upstream and applying the
local fixup.



The last ia64 tcg patch I saw one or two days ago was missing gdb
support, too (and may suffer from some brokenness for that reason).


I didn't see ia64/tcg.  Did you mean s390?  That was host only IIRC.


 But
I would suggest for both cases (tcg and kvm) to temporarily provide
stubs/placeholders instead of keeping the central #ifdefs.
  


Agreed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2506814 ] TAP network lockup after some traffic

2009-05-07 Thread SourceForge.net

Bugs item #2506814, was opened at 2009-01-14 11:38
Message generated for change (Comment added) made by mellen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2506814group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Fabio Coatti (cova)
Assigned to: Nobody/Anonymous (nobody)
Summary: TAP network lockup after some traffic

Initial Comment:
Hi all,
we are experiencing severe network troubles using kvm+tap networking.
basically, after some network load (we have yet to identify the exact amount of 
traffic if one exist) network stops working. 
During lockups, With tcpdump we can see arp requests on guest interface, then 
on tap, brX and physical interfaces on host system.
the arp answers can be seen, with tcpdump, only on physical host interface and 
bridge (brX), but not on tap device. Basically it seems that packets coming 
from external network get lost in tap device on the way to guest (kvm). Looking 
at tap data with ifconfig, the only weird thing is the TX packets overrun count 
that is  0. every time the network stops working, overrun count increases.
This has been observed with several kvm releases (for sure, 
76/77/78/79/80/81/82) and with different kernels (tried with some versions 
among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side.
we tried several network drivers (virtio, e1000, rtl) and all shows the same 
problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has 
acceptable performance)
(btw: on host machine we have vlan on top of ethX devices)
cpu number on guest makes no difference.
we tried with vanilla kernel provided kvm modules and with kvm package provided 
modules.
guest: 32 bit
host: 64 bit
host machine:

2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo.

Of course I can provide more details or perform other tests and try patches, if 
someone can give me some hints and advices they will be most welcome.

Thanks.




--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-07 13:29

Message:
Just had another stall. Different host, different guest. Just one guest on
that host system.

What information would help debug this system?

efer_reload0 0
exits 11147836817  5827
fpu_reload  31483547 1
halt_exits8041208101  1397
halt_wakeup   5069136218 0
host_state_reload 13598781750  1507
hypercalls5890581948 4
insn_emulation8114024171  4338
insn_emulation_fail0 0
invlpg 0 0
io_exits  5263494017 2
irq_exits  78104521153
irq_window 0 0
largepages 0 0
mmio_exits 61151 0
mmu_cache_miss 156941726 0
mmu_flooded158265518 0
mmu_pde_zapped  96062900 0
mmu_pte_updated   2302220251 4
mmu_pte_write 1759022770 4
mmu_recycled   0 0
mmu_shadow_zapped  184691156 0
pf_fixed  4270608276 0
pf_guest  1226004351 0
remote_tlb_flush   481003516 2
request_irq0 0
signal_exits   1 0
tlb_flush 453965557229


--

Comment By: Fabio Coatti (cova)
Date: 2009-05-07 10:51

Message:
We are still biten by this issue. I'm running out of ideas, but if someone
can give me some hints on how to track down the problem or at least collect
more information I'll try it.

Thanks.

--

Comment By: Tais M. Hansen (mellen)
Date: 2009-05-07 10:39

Message:
Just happened again. Seems like it stops generating interrupts on virtio
devices:

 10:   3001  51199   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

doing an /bin/ls froze the guest for a minute or so until a CTRL-C got
through. After /bin/ls:

 10:   3001  51220   IO-APIC-fasteoi   virtio1, virtio2
 11:76948358468134   IO-APIC-fasteoi   virtio0

kvm_stat:
efer_reload0 0
exits 19150288617  1082
fpu_reload 11976730134
halt_exits 544721221   194
halt_wakeup300187634   141
host_state_reload  837279034   259
hypercalls4991133618 0
insn_emulation4797940911   709
insn_emulation_fail0 0
invlpg 0 0

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka

Avi Kivity wrote:
 Jan Kiszka wrote:
 I agree, unless

 - we want to make gdbstub support configurable (don't see any
 overwhelming reason for this, but maybe others do)
 - we want to merge ia64 kvm support upstream, and don't want to impose
 gdbstub support (though I'd recommend properly implementing gdbstub)

 In any case, I'm okay with dropping the check upstream and applying the
 local fixup.
 

 The last ia64 tcg patch I saw one or two days ago was missing gdb
 support, too (and may suffer from some brokenness for that reason).
 
 I didn't see ia64/tcg.  Did you mean s390?  That was host only IIRC.

Ah, you are right.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix irq routing by moving it to the correct place

2009-05-07 Thread Avi Kivity


Avi Kivity wrote:

Jes Sorensen wrote:

Hi,

Some freak :-) put x86 specific irq routing into the generic code path,
which obviously won't work on non-x86 systems.

This patch creates kvm_arch_init_irq_routing() and moves the x86
gibberish to it's correct location :-)



Applied, thanks.



Argh, that didn't build on x86.  I fixed it up.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Matthew Wilcox

On Thu, May 07, 2009 at 01:44:38PM +0300, Avi Kivity wrote:
 I imagine this loop is present in many drivers.  So why not add a helper

Let's look!

arch/x86/kernel/amd_iommu_init.c : Needs an exact number of vectors.
drivers/block/cciss.c : If it doesn't get all 4 vectors, falls back to pin mode 
(instead of MSI mode -- bug!)
drivers/dma/ioat_dma.c : Falls back down to 1 vector if it can't get nvec
drivers/infiniband/hw/mthca/mthca_main.c : Reverts to MSI if it can't get 3.
drivers/net/benet/be_main.c : Falls back to MSI if it can't get 2.
drivers/net/bnx2.c : Falls back to MSI if it can't get 9.
drivers/net/bnx2x_main.c : Falls back to MSI if it can't get N.
drivers/net/e1000e/netdev.c: Falls back to MSI if it can't get N.
drivers/net/enic/enic_main.c: Falls back to MSI if it can't get N.
drivers/net/forcedeth.c: Falls back to MSI if it can't get N.
drivers/net/igb/igb_main.c: Falls back to MSI if it can't get N.
drivers/net/igbvf/netdev.c: Falls back to MSI if it can't get 3.
drivers/net/myri10ge/myri10ge.c: Falls back to Pin if it can't get N.
drivers/net/netxen/netxen_nic_main.c: Falls back to MSI if it can't get N.
drivers/net/qlge/qlge_main.c: Falls back to MSI if it can't get N.
drivers/net/s2io.c: Falls back to MSI if it can't get N.
drivers/net/vxge/vxge-main.c: Falls back once to a second number.
drivers/pci/pcie/portdrv_core.c: Falls back to MSI if it can't get all of them.
drivers/scsi/lpfc/lpfc_init.c: Falls back to MSI if it can't get N.
drivers/scsi/mpt2sas/mpt2sas_base.c: Only allocates 1.

drivers/net/cxgb3/cxgb3_main.c: Actually falls back ... in a bit of a weird 
way.  This one could definitely do with the proposed loop.
drivers/net/ixgbe/ixgbe_main.c: Could also be improved with this loop.
drivers/net/mlx4/main.c: Nasty goto-based loop.
drivers/net/niu.c: Ditto
drivers/net/sfc/efx.c: Only falls back once.  Would benefit from loop.
drivers/scsi/qla2xxx/qla_isr.c: Only falls back once.  Would benefit from loop.
drivers/staging/sxg/sxg.c: /*Should try with less vector returned.*/

so ... 7 drivers would benefit from this loop.  20 seem like they wouldn't.

What a lot of drivers seem to do is fall back either to a single MSI or
just pin-based interrupts when they can't get as many interrupts as they
would like.  They don't try to get a single MSI-X interrupt.  I feel an
update to the MSI-HOWTO coming on.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] kvm: bios: Use a different mask to size the option ROM BAR

2009-05-07 Thread Avi Kivity


Alex Williamson wrote:

Bit 0 is the enable bit, which we not only don't want to set, but
it will stick and make us think it's an I/O port resource.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 kvm/bios/rombios32.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index 8684987..6502e63 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -958,11 +958,13 @@ static void pci_bios_init_device(PCIDevice *d)
 int ofs;
 uint32_t val, size ;
 
-if (i == PCI_ROM_SLOT)

+if (i == PCI_ROM_SLOT) {
 ofs = 0x30;
-else
+pci_config_writel(d, ofs, 0xfffe);
+} else {
 ofs = 0x10 + i * 4;
-pci_config_writel(d, ofs, 0x);
+pci_config_writel(d, ofs, 0x);
+}
 val = pci_config_readl(d, ofs);
 if (val != 0) {
 size = (~(val  ~0xf)) + 1;
  


Looks good.  I think it applies upstream (or even upstream's upstream) 
though, in which case please send it qemu-devel.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Avi Kivity


Sheng Yang wrote:

Some device like VF of SRIOV only support MSI-X.

With this patch, SRIOV can be enabled with KVM assigned device(all kernel space
patches are ready).

  


Just noticed I missed this patch.  Is it still needed or is it obsolete?



diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4d2be16..ce80f3a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id);
assigned_dev-irq_source_id = -1;
 
-	if (!assigned_dev-irq_requested_type)

+   if (!assigned_dev-irq_requested_type) {
+   if (assigned_dev-dev-msi_enabled)
+   pci_disable_msi(assigned_dev-dev);
return;
+   }
 
 	/*

 * In kvm_free_device_irq, cancel_work_sync return true if:
@@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm,
if (adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_INTX)
return 0;
 
+	/* IRQ 0 means uninitialized here */

+   if (airq-host_irq == 0)
+   return 0;
+
if (irqchip_in_kernel(kvm)) {
if (!msi2intx 
(adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_MSI)) {
@@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
r = 0;
} else {
printk(KERN_WARNING
-  kvm: failed to enable MSI device!\n);
-   r = -ENOTTY;
-   goto out_release;
+  kvm: device didn't support INTx or MSI!\n);
+   r = 0;
}
} else {
/* Non-sharing INTx mode */
  



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Marcelo Tosatti

On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote:
 Sheng Yang wrote:
 Some device like VF of SRIOV only support MSI-X.

 With this patch, SRIOV can be enabled with KVM assigned device(all kernel 
 space
 patches are ready).

   

 Just noticed I missed this patch.  Is it still needed or is it obsolete?

I think its obsolete. 


 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 4d2be16..ce80f3a 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
  kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id);
  assigned_dev-irq_source_id = -1;
  -   if (!assigned_dev-irq_requested_type)
 +if (!assigned_dev-irq_requested_type) {
 +if (assigned_dev-dev-msi_enabled)
 +pci_disable_msi(assigned_dev-dev);
  return;
 +}
  /*
   * In kvm_free_device_irq, cancel_work_sync return true if:
 @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm,
  if (adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_INTX)
  return 0;
  +   /* IRQ 0 means uninitialized here */
 +if (airq-host_irq == 0)
 +return 0;
 +
  if (irqchip_in_kernel(kvm)) {
  if (!msi2intx 
  (adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_MSI)) {
 @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
  r = 0;
  } else {
  printk(KERN_WARNING
 -   kvm: failed to enable MSI device!\n);
 -r = -ENOTTY;
 -goto out_release;
 +   kvm: device didn't support INTx or MSI!\n);
 +r = 0;
  }
  } else {
  /* Non-sharing INTx mode */
   


 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Sheng Yang

On Thursday 07 May 2009 20:28:12 Marcelo Tosatti wrote:
 On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote:
  Sheng Yang wrote:
  Some device like VF of SRIOV only support MSI-X.
 
  With this patch, SRIOV can be enabled with KVM assigned device(all
  kernel space patches are ready).
 
  Just noticed I missed this patch.  Is it still needed or is it obsolete?

 I think its obsolete.

Yeah, it have been fixed from another direct in userspace: Interrupt pin =0 
means it haven't been initialized.

-- 
regards
Yang, Sheng


  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index 4d2be16..ce80f3a 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
 kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id);
 assigned_dev-irq_source_id = -1;
   - if (!assigned_dev-irq_requested_type)
  +  if (!assigned_dev-irq_requested_type) {
  +  if (assigned_dev-dev-msi_enabled)
  +  pci_disable_msi(assigned_dev-dev);
 return;
  +  }
 /*
  * In kvm_free_device_irq, cancel_work_sync return true if:
  @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm
  *kvm, if (adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_INTX)
 return 0;
   + /* IRQ 0 means uninitialized here */
  +  if (airq-host_irq == 0)
  +  return 0;
  +
 if (irqchip_in_kernel(kvm)) {
 if (!msi2intx 
 (adev-irq_requested_type  KVM_ASSIGNED_DEV_HOST_MSI)) {
  @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
 r = 0;
 } else {
 printk(KERN_WARNING
  - kvm: failed to enable MSI device!\n);
  -  r = -ENOTTY;
  -  goto out_release;
  + kvm: device didn't support INTx or MSI!\n);
  +  r = 0;
 }
 } else {
 /* Non-sharing INTx mode */
 
  --
  error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen


Avi Kivity wrote:

I agree, unless

- we want to make gdbstub support configurable (don't see any 
overwhelming reason for this, but maybe others do)
- we want to merge ia64 kvm support upstream, and don't want to impose 
gdbstub support (though I'd recommend properly implementing gdbstub)


In any case, I'm okay with dropping the check upstream and applying the 
local fixup.


Hi,

Here's a patch that fixes the #ifndef to make it the #ifdef as it was
intended.

I am quite fine with us trying to drop all the #ifdefs and introduce
noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but
to start with we better just fix the #ifndef to make it behave like it
was originally intended.

Cheers,
Jes



Fix incorrect #ifndef for CONFIG_GETSTUB, which should have been an
#ifdef.

Signed-off-by: Jes Sorensen j...@sgi.com

---
 vl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: qemu/vl.c
===
--- qemu.orig/vl.c
+++ qemu/vl.c
@@ -4350,7 +4350,7 @@
 }
 if (cpu_can_run(env))
 ret = qemu_cpu_exec(env);
-#ifndef CONFIG_GDBSTUB
+#ifdef CONFIG_GDBSTUB
 if (ret == EXCP_DEBUG) {
 gdb_set_stop_cpu(env);
 debug_requested = 1;

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka

Jes Sorensen wrote:
 Avi Kivity wrote:
 I agree, unless

 - we want to make gdbstub support configurable (don't see any
 overwhelming reason for this, but maybe others do)
 - we want to merge ia64 kvm support upstream, and don't want to impose
 gdbstub support (though I'd recommend properly implementing gdbstub)

 In any case, I'm okay with dropping the check upstream and applying
 the local fixup.
 
 Hi,
 
 Here's a patch that fixes the #ifndef to make it the #ifdef as it was
 intended.
 
 I am quite fine with us trying to drop all the #ifdefs and introduce
 noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but
 to start with we better just fix the #ifndef to make it behave like it
 was originally intended.

As the original change this patch (correctly) fixes broke gdbstub
support in qemu-kvm for all archs + it is still a deviation from
upstream, please take the chance and go for ia64 stubs.

BTW, I'm planning to submit a cleanup patch for CONFIG_GDBSTUB to
upstream soon.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen


Jan Kiszka wrote:

As the original change this patch (correctly) fixes broke gdbstub
support in qemu-kvm for all archs + it is still a deviation from
upstream, please take the chance and go for ia64 stubs.


Jan,

There's far more to do in upstream than just gdbstubs for ia64 to be
usable :-(

We'll get to gdbstubs at some point, but there's too many other higher
priority items to deal with first.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka

Jes Sorensen wrote:
 Jan Kiszka wrote:
 As the original change this patch (correctly) fixes broke gdbstub
 support in qemu-kvm for all archs + it is still a deviation from
 upstream, please take the chance and go for ia64 stubs.
 
 Jan,
 
 There's far more to do in upstream than just gdbstubs for ia64 to be
 usable :-(
 
 We'll get to gdbstubs at some point, but there's too many other higher
 priority items to deal with first.

I'm not asking for full gdbstub support at this point, just for a build
workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.)
instead of patching generic code with global risk of regressions.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/4] use smp_send_reschedule in vcpu_kick / assigned dev host intx race fix

2009-05-07 Thread Avi Kivity


Sheng Yang wrote:

Is there any issue blocked this patchset?
  


Yes, a slow maintainer.  I'll go review it now.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/4] qemu: external module: smp_send_reschedule compat

2009-05-07 Thread Avi Kivity


mtosa...@redhat.com wrote:

smp_send_reschedule was exported (via smp_ops) in v2.6.24.
Create a compat function which schedules the IPI to keventd context, 
in case interrupts are disabled, for kernels  2.6.24.


 
 #endif
 
+#if LINUX_VERSION_CODE  KERNEL_VERSION(2,6,28)

+#define nr_cpu_ids NR_CPUS
+#endif
+
  


Doesn't seem related?  Please send as a separate patch.


diff --git a/kvm/kernel/x86/hack-module.awk b/kvm/kernel/x86/hack-module.awk
index 260eeef..f4d14da 100644
--- a/kvm/kernel/x86/hack-module.awk
+++ b/kvm/kernel/x86/hack-module.awk
@@ -35,6 +35,8 @@ BEGIN { split(INIT_WORK desc_struct ldttss_desc64 desc_ptr  
\
 
 { sub(/match-dev-msi_enabled/, kvm_pcidev_msi_enabled(match-dev)) }
 
+{ sub(/smp_send_reschedule/, kvm_smp_send_reschedule) }

+
 /^static void __vmx_load_host_state/ {
 vmx_load_host_state = 1
 }

  


There's a bit on the top of hack-module.awk that does this slightly 
simpler.  Please also adjust ia64.


This now resides in kvm-kmod.git, please patch against that.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen


Jan Kiszka wrote:

I'm not asking for full gdbstub support at this point, just for a build
workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.)
instead of patching generic code with global risk of regressions.


Oh, I see.

I already submitted that to Avi, making gdb_set_stop_cpu() a noop on
ia64. Since there's no target-ia64/ directory in mainline qemu.git yet,
that fix will have to live in qemu-kvm for the time being.

I am fine if you wish to remove all the #ifdef CONFIG_GDBSTUB bits.

Cheers,
Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Marcelo Tosatti

On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote:
 Davide Libenzi wrote:
 On Wed, 6 May 2009, Gregory Haskins wrote:

   
 I think we are ok in this regard (at least in v5) without the 
 callback. kvm holds irqfd, which holds eventfd.  In a normal 
 situation, we will
 have eventfd with 2 references.  If userspace closes the eventfd, it
 will drop 1 of the 2 eventfd file references, but the object should
 remain intact as long as kvm still holds it as well.  When the kvm-fd is
 released, we will then decouple from the eventfd-wqh and drop the last
 fput(), officially freeing it.

 Likewise, if kvm is closed before the eventfd, we will simply decouple
 from the wqh and fput(eventfd), leaving the last reference held by
 userspace until it closes as well.

 Let me know if you see any holes in that.
 

 Looks OK, modulo my knowledge of KVM internals.
   

 What's your take on adding irq context safe callbacks to irqfd?

 To give some background here, we would like to use eventfd as a generic  
 connector between components, so the components do not know about each  
 other.  So far eventfd successfully abstracts among components in the  
 same process, in different processes, and in the kernel.

 eventfd_signal() can be safely called from irq context, and will wake up  
 a waiting task.  But in some cases, if the consumer is in the kernel, it  
 may be able to consume the event from irq context, saving a context 
 switch.

 So, will you consider patches adding this capability to eventfd?

(pasting from a separate thread)

 That's my thinking.  PCI interrupts don't work because we need to do  
 some hacky stuff in there, but MSI should.  Oh, and we could improve
 UIO  
 support for interrupts when using MSI, since there's no need to  
 acknowledge the interrupt.

Ok, so for INTx assigned devices all you need to do on the ACK handler
is to re-enable the host interrupt (and set the guest interrupt line to
low).

Right now the ack comes through a kvm internal irq ack callback.

AFAICS there is no mechanism in irqfd for ACK notification, and
interrupt injection is edge triggered.

So for PCI INTx assigned devices (or any INTx level), you'd want to keep
the guest interrupt high, with some way to notify the ACK.

Avi mentioned a separate irqfd to notify the ACK. For assigned devices,
you could register a fd wakeup function in that fd, which replaces the
current irq ACK callback?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Gregory Haskins

Marcelo Tosatti wrote:
 On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote:
   
 Davide Libenzi wrote:
 
 On Wed, 6 May 2009, Gregory Haskins wrote:

   
   
 I think we are ok in this regard (at least in v5) without the 
 callback. kvm holds irqfd, which holds eventfd.  In a normal 
 situation, we will
 have eventfd with 2 references.  If userspace closes the eventfd, it
 will drop 1 of the 2 eventfd file references, but the object should
 remain intact as long as kvm still holds it as well.  When the kvm-fd is
 released, we will then decouple from the eventfd-wqh and drop the last
 fput(), officially freeing it.

 Likewise, if kvm is closed before the eventfd, we will simply decouple
 from the wqh and fput(eventfd), leaving the last reference held by
 userspace until it closes as well.

 Let me know if you see any holes in that.
 
 
 Looks OK, modulo my knowledge of KVM internals.
   
   
 What's your take on adding irq context safe callbacks to irqfd?

 To give some background here, we would like to use eventfd as a generic  
 connector between components, so the components do not know about each  
 other.  So far eventfd successfully abstracts among components in the  
 same process, in different processes, and in the kernel.

 eventfd_signal() can be safely called from irq context, and will wake up  
 a waiting task.  But in some cases, if the consumer is in the kernel, it  
 may be able to consume the event from irq context, saving a context 
 switch.

 So, will you consider patches adding this capability to eventfd?
 

 (pasting from a separate thread)

   
 That's my thinking.  PCI interrupts don't work because we need to do  
 some hacky stuff in there, but MSI should.  Oh, and we could improve
 UIO  
 support for interrupts when using MSI, since there's no need to  
 acknowledge the interrupt.
 

 Ok, so for INTx assigned devices all you need to do on the ACK handler
 is to re-enable the host interrupt (and set the guest interrupt line to
 low).

 Right now the ack comes through a kvm internal irq ack callback.

 AFAICS there is no mechanism in irqfd for ACK notification, and
 interrupt injection is edge triggered.

 So for PCI INTx assigned devices (or any INTx level), you'd want to keep
 the guest interrupt high, with some way to notify the ACK.

 Avi mentioned a separate irqfd to notify the ACK. For assigned devices,
 you could register a fd wakeup function in that fd, which replaces the
 current irq ACK callback?
   

One thing I was thinking here was that I could create a flag for the
kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR.  This
flag when specified at creation time will cause the event to execute a
clear operation instead of a set when triggered.  That way, the default
mode is an edge-triggered set.  The non-default mode is to trigger a
clear.  Level-triggered ints could therefore create two irqfds, one for
raising, the other for clearing.

An alternative is to abandon the use of eventfd, and allow the irqfd to
be a first-class anon-fd.  The parameters passed to the write/signal()
function could then indicate the desired level.  The disadvantage would
be that it would not be compatible with eventfd, so we would need to
decide if the tradeoff is worth it.

OTOH, I suspect level triggered interrupts will be primarily in the
legacy domain, so perhaps we do not need to worry about it too much. 
Therefore, another option is that we *could* simply set the stake in the
ground that legacy/level cannot use irqfd.

Thoughts?

-Greg




signature.asc
Description: OpenPGP digital signature

[PATCHv2 0/3] virtio: add guest MSI-X support

2009-05-07 Thread Michael S. Tsirkin

Add optional MSI-X support: use a vector per virtqueue with
fallback to a common vector and finally to regular interrupt.
Teach all drivers to use it.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

Here's a draft set of patches for MSI-X support in the guest.  It still
needs to be tested properly, and performance impact measured, but I
thought I'd share it here in the hope of getting some very early
feedback/flames.

Changelog since v1:
- Per Avi's suggestion, let guest configure virtqueue to vector mapping
- Per Rusty's suggestion, replace API with find_vqs/del_vqs.

Michael S. Tsirkin (3):
  virtio: find_vqs/del_vqs virtio operations
  virtio_pci: split up vp_interrupt
  virtio_pci: optional MSI-X support

 drivers/block/virtio_blk.c  |   11 +-
 drivers/char/hw_random/virtio-rng.c |   11 +-
 drivers/char/virtio_console.c   |   27 ++--
 drivers/lguest/lguest_device.c  |   49 ++-
 drivers/net/virtio_net.c|   47 +++
 drivers/s390/kvm/kvm_virtio.c   |   64 -
 drivers/virtio/virtio_balloon.c |   31 ++--
 drivers/virtio/virtio_pci.c |  283 ++-
 include/linux/virtio_config.h   |   29 +++-
 include/linux/virtio_pci.h  |8 +-
 10 files changed, 440 insertions(+), 120 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-07 Thread Michael S. Tsirkin

This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/block/virtio_blk.c  |   11 +++---
 drivers/char/hw_random/virtio-rng.c |   11 +++---
 drivers/char/virtio_console.c   |   27 ++
 drivers/lguest/lguest_device.c  |   49 +-
 drivers/net/virtio_net.c|   47 +++---
 drivers/s390/kvm/kvm_virtio.c   |   64 +-
 drivers/virtio/virtio_balloon.c |   31 -
 drivers/virtio/virtio_pci.c |   43 +++
 include/linux/virtio_config.h   |   29 +++-
 9 files changed, 222 insertions(+), 90 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5d34764..7b7435d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -197,6 +197,7 @@ static int virtblk_probe(struct virtio_device *vdev)
u64 cap;
u32 v;
u32 blk_size, sg_elems;
+   virtqueue_callback *callback[] = { blk_done };
 
if (index_to_minor(index) = 1  MINORBITS)
return -ENOSPC;
@@ -224,11 +225,9 @@ static int virtblk_probe(struct virtio_device *vdev)
sg_init_table(vblk-sg, vblk-sg_elems);
 
/* We expect one virtqueue, for output. */
-   vblk-vq = vdev-config-find_vq(vdev, 0, blk_done);
-   if (IS_ERR(vblk-vq)) {
-   err = PTR_ERR(vblk-vq);
+   err = vdev-config-find_vqs(vdev, 1, vblk-vq, callback);
+   if (err)
goto out_free_vblk;
-   }
 
vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk-pool) {
@@ -323,7 +322,7 @@ out_put_disk:
 out_mempool:
mempool_destroy(vblk-pool);
 out_free_vq:
-   vdev-config-del_vq(vblk-vq);
+   vdev-config-del_vqs(vdev);
 out_free_vblk:
kfree(vblk);
 out:
@@ -344,7 +343,7 @@ static void virtblk_remove(struct virtio_device *vdev)
blk_cleanup_queue(vblk-disk-queue);
put_disk(vblk-disk);
mempool_destroy(vblk-pool);
-   vdev-config-del_vq(vblk-vq);
+   vdev-config-del_vqs(vdev);
kfree(vblk);
 }
 
diff --git a/drivers/char/hw_random/virtio-rng.c 
b/drivers/char/hw_random/virtio-rng.c
index 86e83f8..18eabe4 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -91,16 +91,17 @@ static struct hwrng virtio_hwrng = {
 
 static int virtrng_probe(struct virtio_device *vdev)
 {
+   virtqueue_callback *callback[] = { random_recv_done };
int err;
 
/* We expect a single virtqueue. */
-   vq = vdev-config-find_vq(vdev, 0, random_recv_done);
-   if (IS_ERR(vq))
-   return PTR_ERR(vq);
+   err = vdev-config-find_vqs(vdev, 1, vq, callback);
+   if (err)
+   return err;
 
err = hwrng_register(virtio_hwrng);
if (err) {
-   vdev-config-del_vq(vq);
+   vdev-config-del_vqs(vdev);
return err;
}
 
@@ -112,7 +113,7 @@ static void virtrng_remove(struct virtio_device *vdev)
 {
vdev-config-reset(vdev);
hwrng_unregister(virtio_hwrng);
-   vdev-config-del_vq(vq);
+   vdev-config-del_vqs(vdev);
 }
 
 static struct virtio_device_id id_table[] = {
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index ff6f5a4..1fd5376 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -188,6 +188,8 @@ static void hvc_handle_input(struct virtqueue *vq)
  * Finally we put our input buffer in the input queue, ready to receive. */
 static int __devinit virtcons_probe(struct virtio_device *dev)
 {
+   struct virtqueue *vqs[2];
+   virtqueue_callback *callbacks[2];
int err;
 
vdev = dev;
@@ -199,20 +201,17 @@ static int __devinit virtcons_probe(struct virtio_device 
*dev)
goto fail;
}
 
-   /* Find the input queue. */
+   /* Find the queues. */
/* FIXME: This is why we want to wean off hvc: we do nothing
 * when input comes in. */
-   in_vq = vdev-config-find_vq(vdev, 0, hvc_handle_input);
-   if (IS_ERR(in_vq)) {
-   err = PTR_ERR(in_vq);
+   callbacks[0] = hvc_handle_input;
+   callbacks[1] = NULL;
+   err = vdev-config-find_vqs(vdev, 2, vqs, callbacks);
+   if (err)
goto free;
-   }
 
-   out_vq = vdev-config-find_vq(vdev, 1, NULL);
-   if (IS_ERR(out_vq)) {
-   err = PTR_ERR(out_vq);
-   goto free_in_vq;
-   }
+   in_vq = vqs[0];
+   out_vq = vqs[1];
 
/* Start using the new console output. */
virtio_cons.get_chars = get_chars;
@@ -233,17 +232,15 @@ static int __devinit virtcons_probe(struct virtio_device 
*dev)

[PATCH 2/3] virtio_pci: split up vp_interrupt

2009-05-07 Thread Michael S. Tsirkin

This reorganizes virtio-pci code in vp_interrupt slightly, so that
it's easier to add per-vq MSI support on top.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |   45 +-
 1 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3671c42..f7b79a2 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -164,6 +164,37 @@ static void vp_notify(struct virtqueue *vq)
iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
+/* Handle a configuration change: Tell driver if it wants to know. */
+static irqreturn_t vp_config_changed(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_driver *drv;
+   drv = container_of(vp_dev-vdev.dev.driver,
+  struct virtio_driver, driver);
+
+   if (drv  drv-config_changed)
+   drv-config_changed(vp_dev-vdev);
+   return IRQ_HANDLED;
+}
+
+/* Notify all virtqueues on an interrupt. */
+static irqreturn_t vp_vring_interrupt(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_pci_vq_info *info;
+   irqreturn_t ret = IRQ_NONE;
+   unsigned long flags;
+
+   spin_lock_irqsave(vp_dev-lock, flags);
+   list_for_each_entry(info, vp_dev-virtqueues, node) {
+   if (vring_interrupt(irq, info-vq) == IRQ_HANDLED)
+   ret = IRQ_HANDLED;
+   }
+   spin_unlock_irqrestore(vp_dev-lock, flags);
+
+   return ret;
+}
+
 /* A small wrapper to also acknowledge the interrupt when it's handled.
  * I really need an EIO hook for the vring so I can ack the interrupt once we
  * know that we'll be handling the IRQ but before we invoke the callback since
@@ -173,9 +204,6 @@ static void vp_notify(struct virtqueue *vq)
 static irqreturn_t vp_interrupt(int irq, void *opaque)
 {
struct virtio_pci_device *vp_dev = opaque;
-   struct virtio_pci_vq_info *info;
-   irqreturn_t ret = IRQ_NONE;
-   unsigned long flags;
u8 isr;
 
/* reading the ISR has the effect of also clearing it so it's very
@@ -187,14 +215,11 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return IRQ_NONE;
 
/* Configuration change?  Tell driver if it wants to know. */
-   if (isr  VIRTIO_PCI_ISR_CONFIG) {
-   struct virtio_driver *drv;
-   drv = container_of(vp_dev-vdev.dev.driver,
-  struct virtio_driver, driver);
+   if (isr  VIRTIO_PCI_ISR_CONFIG)
+   vp_config_changed(irq, opaque);
 
-   if (drv  drv-config_changed)
-   drv-config_changed(vp_dev-vdev);
-   }
+   return vp_vring_interrupt(irq, opaque);
+}
 
spin_lock_irqsave(vp_dev-lock, flags);
list_for_each_entry(info, vp_dev-virtqueues, node) {
-- 
1.6.3.rc3.1.g830204

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] virtio_pci: optional MSI-X support

2009-05-07 Thread Michael S. Tsirkin

This implements optional MSI-X support in virtio_pci.
MSI-X is used whenever the host supports at least 2 MSI-X
vectors: 1 for configuration changes and 1 for virtqueues.
Per-virtqueue vectors are allocated if enough vectors
available.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |  209 +--
 include/linux/virtio_pci.h  |8 ++-
 2 files changed, 190 insertions(+), 27 deletions(-)

diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index cd0fd5d..4a0275b 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -47,9 +47,15 @@
 /* The bit of the ISR which indicates a device configuration change. */
 #define VIRTIO_PCI_ISR_CONFIG  0x2
 
+/* MSI-X registers: only enabled if MSI-X is enabled. */
+/* A 16-bit vector for configuration changes. */
+#define VIRTIO_MSI_CONFIG_VECTOR20
+/* A 16-bit vector for selected queue notifications. */
+#define VIRTIO_MSI_QUEUE_VECTOR 22
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG  20
+#define VIRTIO_PCI_CONFIG(dev) ((dev)-msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION 0
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index f7b79a2..2b6333c 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -42,6 +42,28 @@ struct virtio_pci_device
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
struct list_head virtqueues;
+
+   /* MSI-X support */
+   int msix_enabled;
+   struct msix_entry *msix_entries;
+   /* Name strings for interrupts. This size should be enough,
+* and I'm too lazy to allocate each name separately. */
+   char (*msix_names)[256];
+   /* Number of vectors configured at startup (excludes per-virtqueue
+* vectors if any) */
+   unsigned msix_preset_vectors;
+   /* Number of per-virtqueue vectors if any. */
+   unsigned msix_per_vq_vectors;
+};
+
+/* Constants for MSI-X */
+/* Use first vector for configuration changes, second and the rest for
+ * virtqueues Thus, we need at least 2 vectors for MSI. */
+enum {
+   VP_MSIX_CONFIG_VECTOR = 0,
+   VP_MSIX_VQ_VECTOR = 1,
+   VP_MSIX_MIN_VECTORS = 2,
+   VP_MSIX_NO_VECTOR = 0x,
 };
 
 struct virtio_pci_vq_info
@@ -60,6 +82,9 @@ struct virtio_pci_vq_info
 
/* the list node for the virtqueues list */
struct list_head node;
+
+   /* MSI-X vector (or none) */
+   unsigned vector;
 };
 
 /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
@@ -109,7 +134,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
offset,
   void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev-ioaddr +
+   VIRTIO_PCI_CONFIG(vp_dev) + offset;
u8 *ptr = buf;
int i;
 
@@ -123,7 +149,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
offset,
   const void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev-ioaddr +
+  VIRTIO_PCI_CONFIG(vp_dev) + offset;
const u8 *ptr = buf;
int i;
 
@@ -221,17 +248,110 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return vp_vring_interrupt(irq, opaque);
 }
 
-   spin_lock_irqsave(vp_dev-lock, flags);
-   list_for_each_entry(info, vp_dev-virtqueues, node) {
-   if (vring_interrupt(irq, info-vq) == IRQ_HANDLED)
-   ret = IRQ_HANDLED;
+/* the config-free_vqs() implementation */
+static void vp_free_vectors(struct virtio_device *vdev) {
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   int i;
+
+   /* Disable the vector used for configuration */
+   iowrite16(VP_MSIX_NO_VECTOR,
+ vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR);
+
+   for (i = 0; i  vp_dev-msix_preset_vectors; ++i)
+   free_irq(vp_dev-msix_entries[i].vector, vp_dev);
+
+   if (!vp_dev-msix_preset_vectors)
+   free_irq(vp_dev-pci_dev-irq, vp_dev);
+
+   if (vp_dev-msix_enabled) {
+   vp_dev-msix_enabled = 0;
+   pci_disable_msix(vp_dev-pci_dev);
}
-   spin_unlock_irqrestore(vp_dev-lock, flags);
+}
 
-   return ret;
+static int vp_request_vectors(struct virtio_device *vdev, unsigned max_vqs)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   const char *name = dev_name(vp_dev-vdev.dev);
+   unsigned i, vectors;
+   int err =

Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3

2009-05-07 Thread Avi Kivity


mtosa...@redhat.com wrote:

Disallow the deletion of memory slots (and aliases, for x86 case), if a
vcpu contains a cr3 that points to such slot/alias.
  


That allows the guest to induce failures in the host.  Better to 
triple-fault the guest instead.


 
+int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)

+{
+   return 1;
+}
+
  


In general, instead of stubs in every arch, have x86 say 
KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when 
that define is not present.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

One thing I was thinking here was that I could create a flag for the
kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR.  This
flag when specified at creation time will cause the event to execute a
clear operation instead of a set when triggered.  That way, the default
mode is an edge-triggered set.  The non-default mode is to trigger a
clear.  Level-triggered ints could therefore create two irqfds, one for
raising, the other for clearing.
  


That's my second choice option.


An alternative is to abandon the use of eventfd, and allow the irqfd to
be a first-class anon-fd.  The parameters passed to the write/signal()
function could then indicate the desired level.  The disadvantage would
be that it would not be compatible with eventfd, so we would need to
decide if the tradeoff is worth it.
  


I would really like to keep using eventfd.  Which is why I asked Davide 
about the prospects of direct callbacks (vs wakeups).



OTOH, I suspect level triggered interrupts will be primarily in the
legacy domain, so perhaps we do not need to worry about it too much. 
Therefore, another option is that we *could* simply set the stake in the

ground that legacy/level cannot use irqfd.
  


This is my preferred option.  For a virtio-net-server in the kernel, 
we'd service its eventfd in qemu, raising and lowering the pci interrupt 
in the traditional way.


But we'd still need to know when to lower the interrupt.  How?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Davide Libenzi

On Thu, 7 May 2009, Avi Kivity wrote:

 What's your take on adding irq context safe callbacks to irqfd?
 
 To give some background here, we would like to use eventfd as a generic
 connector between components, so the components do not know about each other.
 So far eventfd successfully abstracts among components in the same process, in
 different processes, and in the kernel.
 
 eventfd_signal() can be safely called from irq context, and will wake up a
 waiting task.  But in some cases, if the consumer is in the kernel, it may be
 able to consume the event from irq context, saving a context switch.
 
 So, will you consider patches adding this capability to eventfd?

Maybe I got lost in the thread, but inside the kernel we have 
callback-based wakeup since long time. This is what epoll uses, when 
hooking into the file* f_op-poll() subsystem.
Did you mean something else?


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Gregory Haskins

Avi Kivity wrote:
 Gregory Haskins wrote:
 One thing I was thinking here was that I could create a flag for the
 kvm_irqfd() function for something like KVM_IRQFD_MODE_CLEAR.  This
 flag when specified at creation time will cause the event to execute a
 clear operation instead of a set when triggered.  That way, the default
 mode is an edge-triggered set.  The non-default mode is to trigger a
 clear.  Level-triggered ints could therefore create two irqfds, one for
 raising, the other for clearing.
   

 That's my second choice option.

 An alternative is to abandon the use of eventfd, and allow the irqfd to
 be a first-class anon-fd.  The parameters passed to the write/signal()
 function could then indicate the desired level.  The disadvantage would
 be that it would not be compatible with eventfd, so we would need to
 decide if the tradeoff is worth it.
   

 I would really like to keep using eventfd.  Which is why I asked
 Davide about the prospects of direct callbacks (vs wakeups).

I saw that request.  That would be ideal.


 OTOH, I suspect level triggered interrupts will be primarily in the
 legacy domain, so perhaps we do not need to worry about it too much.
 Therefore, another option is that we *could* simply set the stake in the
 ground that legacy/level cannot use irqfd.
   

 This is my preferred option.  For a virtio-net-server in the kernel,
 we'd service its eventfd in qemu, raising and lowering the pci
 interrupt in the traditional way.

 But we'd still need to know when to lower the interrupt.  How?

IIUC, isn't that  usually device/subsystem specific, and out of scope of
the GSI delivery vehicle?  For instance, most devices I have seen with
level ints have a register in their device register namespace for acking
the int.  As an aside, this is what causes some of the grief in dealing
with shared interrupts like KVM pass-through and/or threaded-isrs:  
There isn't a standardized way to ACK them.

You may also see some generalization of masking/acking in things like
the MSI-X table.  But again, this would be out of scope of the general
GSI delivery path IIUC.

I understand that there is a feedback mechanism in the ioapic model for
calling back on acknowledgment of the interrupt.  But I am not sure what
is how the real hardware works normally, and therefore I am not
convinced that is something we need to feed all the way back (i.e. via
irqfd or whatever).  In the interest of full disclosure, its been a few
years since I studied the xAPIC docs, so I might be out to lunch on that
assertion. ;)

-Greg





signature.asc
Description: OpenPGP digital signature

[PATCH 1/4] BIOS changes for configuring irq0-inti2 override (v2)

2009-05-07 Thread Beth Kon

These patches resolve the irq0-inti2 override issue, and get the hpet working
on kvm. 

They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch.

Override and HPET changes are sent as a series because HPET depends on the 
override. Win2k8 expects the HPET interrupt on inti2, regardless of whether 
an override exists in the BIOS. And the HPET spec states that in legacy mode, 
timer interrupt is on inti2.

The irq0-inti2 override will always be used unless the kernel cannot do irq 
routing (i.e., compatibility with old kernels). So if the kernel is capable, 
userspace sets up irq0-inti2 via the irq routing interface, and adds the 
irq0-inti2 override to the MADT interrupt source override table, 
and the mp table (for the no-acpi case).

A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid
checksum with these patches, but later he couldn't reproduce it, and I'm not 
seeing it now. While all guests still need to be fully tested, everything 
appears to be in order.  I've tested on win2k864, win2k832, RHEL5.3 32 bit, 
and ubuntu 8.10 64 bit. 

Signed-off-by: Beth Kon e...@us.ibm.com

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index 8684987..07dda73 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -445,6 +445,9 @@ uint32_t cpuid_ext_features;
 unsigned long ram_size;
 uint64_t ram_end;
 uint8_t bios_uuid[16];
+#ifdef BX_QEMU
+uint8_t irq0_override;
+#endif
 #ifdef BX_USE_EBDA_TABLES
 unsigned long ebda_cur_addr;
 #endif
@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
 
 int qemu_cfg_port;
 
@@ -518,6 +522,18 @@ void uuid_probe(void)
 memset(bios_uuid, 0, 16);
 }
 
+#ifdef BX_QEMU
+void irq0_override_probe(void)
+{
+if(qemu_cfg_port) {
+qemu_cfg_select(QEMU_CFG_IRQ0_OVERRIDE);
+qemu_cfg_read(irq0_override, 1);
+return;
+}
+memset(irq0_override, 0, 1);
+}
+#endif
+
 void cpu_probe(void)
 {
 uint32_t eax, ebx, ecx, edx;
@@ -1160,6 +1176,13 @@ static void mptable_init(void)
 
 /* irqs */
 for(i = 0; i  16; i++) {
+#ifdef BX_QEMU
+/* One entry per ioapic interrupt destination. Destination 2 is covered
+ * by irq0-inti2 override (i == 0). Source IRQ 2 is unused 
+ */
+if (irq0_override  i == 2)
+continue;
+#endif
 putb(q, 3); /* entry type = I/O interrupt */
 putb(q, 0); /* interrupt type = vectored interrupt */
 putb(q, 0); /* flags: po=0, el=0 */
@@ -1167,7 +1190,12 @@ static void mptable_init(void)
 putb(q, 0); /* source bus ID = ISA */
 putb(q, i); /* source bus IRQ */
 putb(q, ioapic_id); /* dest I/O APIC ID */
-putb(q, i); /* dest I/O APIC interrupt in */
+#ifdef BX_QEMU
+if (irq0_override  i == 0)
+putb(q, 2); /* dest I/O APIC interrupt in */
+else
+#endif
+putb(q, i); /* dest I/O APIC interrupt in */
 }
 /* patch length */
 len = q - mp_config_table;
@@ -1550,16 +1578,18 @@ void acpi_bios_init(void)
 
 addr = (addr + 7)  ~7;
 madt_addr = addr;
+madt = (void *)(addr);
 madt_size = sizeof(*madt) +
 sizeof(struct madt_processor_apic) * MAX_CPUS +
-#ifdef BX_QEMU
-sizeof(struct madt_io_apic) /* + sizeof(struct madt_int_override) */;
-#else
 sizeof(struct madt_io_apic);
+#ifdef BX_QEMU
+for (i = 0; i  16; i++)
+if (PCI_ISA_IRQ_MASK  (1U  i))
+madt_size += sizeof(struct madt_int_override);
+if (irq0_override)
+madt_size += sizeof(struct madt_int_override);
 #endif
-madt = (void *)(addr);
 addr += madt_size;
-
 #ifdef BX_QEMU
 #ifdef HPET_WORKS_IN_KVM
 addr = (addr + 7)  ~7;
@@ -1660,23 +1690,20 @@ void acpi_bios_init(void)
 io_apic-io_apic_id = smp_cpus;
 io_apic-address = cpu_to_le32(0xfec0);
 io_apic-interrupt = cpu_to_le32(0);
+int_override = (struct madt_int_override*)(io_apic + 1);
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
-io_apic++;
-
-int_override = (void *)io_apic;
-int_override-type = APIC_XRUPT_OVERRIDE;
-int_override-length = sizeof(*int_override);
-int_override-bus = cpu_to_le32(0);
-int_override-source = cpu_to_le32(0);
-int_override-gsi = cpu_to_le32(2);
-int_override-flags = cpu_to_le32(0);
-#endif
+if (irq0_override) {
+memset(int_override, 0, sizeof(*int_override));
+int_override-type = APIC_XRUPT_OVERRIDE;
+int_override-length = sizeof(*int_override);
+int_override-source = 0;
+int_override-gsi = 2;
+int_override-flags = 0; /* conforms to bus specifications */
+int_override++;
+}
 #endif
-
-int_override = (struct madt_int_override*)(io_apic + 1);
 for ( i = 0; i  16; i++ ) {

[PATCH 3/4] BIOS changes for KVM HPET (v2)

2009-05-07 Thread Beth Kon


Just a note here... 
The number of table_offset_entry entries for the non BX_QEMU case doesn't make  
sense here. There are only 2 entries. I left it as is, since it does not 
impact HPET's interraction with it. Actually it seems like dead code since 
this is in kvm code but with BX_QEMU undefined. It doesn't seem to be a
problem. 


Signed-off-by: Beth Kon e...@us.ibm.com


diff --git a/kvm/bios/acpi-dsdt.dsl b/kvm/bios/acpi-dsdt.dsl
index c756fed..0e142be 100755
--- a/kvm/bios/acpi-dsdt.dsl
+++ b/kvm/bios/acpi-dsdt.dsl
@@ -308,7 +308,6 @@ DefinitionBlock (
 })
 }
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
 Device(HPET) {
 Name(_HID,  EISAID(PNP0103))
 Name(_UID, 0)
@@ -328,7 +327,6 @@ DefinitionBlock (
 })
 }
 #endif
-#endif
 }
 
 Scope(\_SB.PCI0) {
diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index ddfa828..7441cd7 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -1293,7 +1293,7 @@ struct rsdt_descriptor_rev1
 {
ACPI_TABLE_HEADER_DEF   /* ACPI common table 
header */
 #ifdef BX_QEMU
-   uint32_t table_offset_entry [2]; /* Array 
of pointers to other */
+   uint32_t table_offset_entry [3]; /* Array 
of pointers to other */
 // uint32_t table_offset_entry [4]; /* Array 
of pointers to other */
 #else
uint32_t table_offset_entry [3]; /* Array 
of pointers to other */
@@ -1450,8 +1450,8 @@ struct acpi_20_generic_address {
 } __attribute__((__packed__));
 
 /*
- *  * HPET Description Table
- *   */
+ *  HPET Description Table
+ */
 struct acpi_20_hpet {
 ACPI_TABLE_HEADER_DEF   /* ACPI common table 
header */
 uint32_t   timer_block_id;
@@ -1591,13 +1591,11 @@ void acpi_bios_init(void)
 #endif
 addr += madt_size;
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
 addr = (addr + 7)  ~7;
 hpet_addr = addr;
 hpet = (void *)(addr);
 addr += sizeof(*hpet);
 #endif
-#endif
 
 acpi_tables_size = addr - base_addr;
 
@@ -1620,10 +1618,10 @@ void acpi_bios_init(void)
 memset(rsdt, 0, sizeof(*rsdt));
 rsdt-table_offset_entry[0] = cpu_to_le32(fadt_addr);
 rsdt-table_offset_entry[1] = cpu_to_le32(madt_addr);
-//rsdt-table_offset_entry[2] = cpu_to_le32(ssdt_addr);
 #ifdef BX_QEMU
-//rsdt-table_offset_entry[3] = cpu_to_le32(hpet_addr);
+rsdt-table_offset_entry[2] = cpu_to_le32(hpet_addr);
 #endif
+//rsdt-table_offset_entry[3] = cpu_to_le32(ssdt_addr);
 acpi_build_table_header((struct acpi_table_header *)rsdt,
 RSDT, sizeof(*rsdt), 1);
 
@@ -1723,7 +1721,6 @@ void acpi_bios_init(void)
 
 #ifdef BX_QEMU
 /* HPET */
-#ifdef HPET_WORKS_IN_KVM
 memset(hpet, 0, sizeof(*hpet));
 /* Note timer_block_id value must be kept in sync with value advertised by
  * emulated hpet
@@ -1733,7 +1730,6 @@ void acpi_bios_init(void)
 acpi_build_table_header((struct  acpi_table_header *)hpet,
  HPET, sizeof(*hpet), 1);
 #endif
-#endif
 
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] Userspace changes for KVM HPET (v2)

2009-05-07 Thread Beth Kon


Signed-off-by: Beth Kon e...@us.ibm.com


diff --git a/hw/hpet.c b/hw/hpet.c
index c7945ec..47c9f89 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -30,6 +30,7 @@
 #include console.h
 #include qemu-timer.h
 #include hpet_emul.h
+#include qemu-kvm.h
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -48,6 +49,43 @@ uint32_t hpet_in_legacy_mode(void)
 return 0;
 }
 
+static void hpet_kpit_enable(void)
+{
+struct kvm_pit_state ps;
+kvm_get_pit(kvm_context, ps);
+kvm_set_pit(kvm_context, ps);
+}
+
+static void hpet_kpit_disable(void)
+{
+struct kvm_pit_state ps;
+kvm_get_pit(kvm_context, ps);
+ps.channels[0].mode = 0xff;
+kvm_set_pit(kvm_context, ps);
+}
+
+static void hpet_legacy_enable(void)
+{
+if (qemu_kvm_pit_in_kernel()) {
+   hpet_kpit_disable();
+   dprintf(qemu: hpet disabled kernel pit\n);
+} else {
+   hpet_pit_disable();
+   dprintf(qemu: hpet disabled userspace pit\n);
+}
+}
+
+static void hpet_legacy_disable(void)
+{
+if (qemu_kvm_pit_in_kernel()) {
+   hpet_kpit_enable();
+   dprintf(qemu: hpet enabled kernel pit\n);
+} else {
+   hpet_pit_enable();
+   dprintf(qemu: hpet enabled userspace pit\n);
+}
+}
+
 static uint32_t timer_int_route(struct HPETTimer *timer)
 {
 uint32_t route;
@@ -475,9 +513,9 @@ static void hpet_ram_writel(void *opaque, 
target_phys_addr_t addr,
 }
 /* i8254 and RTC are disabled when HPET is in legacy mode */
 if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
-hpet_pit_disable();
+hpet_legacy_enable();
 } else if (deactivating_bit(old_val, new_val, 
HPET_CFG_LEGACY)) {
-hpet_pit_enable();
+hpet_legacy_disable();
 }
 break;
 case HPET_CFG + 4:
@@ -560,7 +598,7 @@ static void hpet_reset(void *opaque) {
  * hpet_reset is called due to system reset. At this point control must
  * be returned to pit until SW reenables hpet.
  */
-hpet_pit_enable();
+hpet_legacy_disable();
 count = 1;
 }
 
diff --git a/vl.c b/vl.c
index f9a72b3..b860b82 100644
--- a/vl.c
+++ b/vl.c
@@ -6138,10 +6138,15 @@ int main(int argc, char **argv, char **envp)
 }
 
 if (kvm_enabled()) {
-   kvm_init_ap();
+kvm_init_ap();
 #ifdef USE_KVM
 if (kvm_irqchip  !qemu_kvm_has_gsi_routing()) {
 irq0override = 0;
+/* if kernel can't do irq routing, interrupt source
+ * override 0-2 can not be set up as required by hpet,
+ * so disable hpet.
+ */
+no_hpet=1;
 }
 #endif
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM

2009-05-07 Thread Mark Langsdorf

commit 01813db8627e74018c8cec90df7e345839351f23
Author: root r...@xendinar01.amd.com
Date:   Thu May 7 09:44:10 2009 -0500

New AMD processors will support the Pause Filter Feature.
This feature creates a new field in the VMCB called Pause
Filter Count.  If Pause Filter Count is greater than 0 and
intercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 1%.  Further performance
improvement may be possible with a more sophisticated
yield algorithm.

-Mark Langsdorf
Operating System Research Center
AMD

Signed-off-by: Mark Langsdorf mark.langsd...@amd.com

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 85574b7..1fecb7e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -57,7 +57,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u16 intercept_dr_write;
u32 intercept_exceptions;
u64 intercept;
-   u8 reserved_1[44];
+   u8 reserved_1[42];
+   u16 pause_filter_count;
u64 iopm_base_pa;
u64 msrpm_base_pa;
u64 tsc_offset;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ef43a18..4279141 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -45,6 +45,7 @@ MODULE_LICENSE(GPL);
 #define SVM_FEATURE_NPT  (1  0)
 #define SVM_FEATURE_LBRV (1  1)
 #define SVM_FEATURE_SVML (1  2)
+#define SVM_FEATURE_PAUSE_FILTER (1  10)
 
 #define DEBUGCTL_RESERVED_BITS (~(0x3fULL))
 
@@ -575,6 +576,12 @@ static void init_vmcb(struct vcpu_svm *svm)
 
svm-nested_vmcb = 0;
svm-vcpu.arch.hflags = HF_GIF_MASK;
+
+   if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+   control-pause_filter_count = 5000;
+   control-intercept |= (1ULL  INTERCEPT_PAUSE);
+   }
+
 }
 
 static int svm_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -2087,6 +2094,15 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm,
return 1;
 }
 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+   /* Simple yield */
+   vcpu_put(svm-vcpu);
+   schedule();
+   vcpu_load(svm-vcpu);
+   return 1;
+}
+
 static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
  struct kvm_run *kvm_run) = {
[SVM_EXIT_READ_CR0] = emulate_on_interception,
@@ -2123,6 +2139,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
[SVM_EXIT_CPUID]= cpuid_interception,
[SVM_EXIT_IRET] = iret_interception,
[SVM_EXIT_INVD] = emulate_on_interception,
+   [SVM_EXIT_PAUSE]= pause_interception,
[SVM_EXIT_HLT]  = halt_interception,
[SVM_EXIT_INVLPG]   = invlpg_interception,
[SVM_EXIT_INVLPGA]  = invalid_op_interception,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2b73e19..e2b730d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -710,6 +710,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
 }
+EXPORT_SYMBOL_GPL(vcpu_load);
 
 void vcpu_put(struct kvm_vcpu *vcpu)
 {
@@ -719,6 +720,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
preempt_enable();
mutex_unlock(vcpu-mutex);
 }
+EXPORT_SYMBOL_GPL(vcpu_put);
 
 static void ack_flush(void *_completed)
 {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] Userspace changes for configuring irq0-inti2 override (v2)

2009-05-07 Thread Beth Kon

Signed-off-by: Beth Kon e...@us.ibm.com

diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index e1b19d7..bb74f38 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16);
 fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)nographic);
 fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus);
+fw_cfg_add_i16(s, FW_CFG_IRQ0_OVERRIDE, (uint16_t)irq0override);
 
 register_savevm(fw_cfg, -1, 1, fw_cfg_save, fw_cfg_load, s);
 qemu_register_reset(fw_cfg_reset, s);
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index f616ed2..498c1e3 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -15,6 +15,7 @@
 #define FW_CFG_INITRD_SIZE  0x0b
 #define FW_CFG_BOOT_DEVICE  0x0c
 #define FW_CFG_NUMA 0x0d
+#define FW_CFG_IRQ0_OVERRIDE0x0e
 #define FW_CFG_MAX_ENTRY0x10
 
 #define FW_CFG_WRITE_CHANNEL0x4000
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 0b70cf6..2d77a2c 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -23,6 +23,7 @@
 
 #include hw.h
 #include pc.h
+#include sysemu.h
 #include qemu-timer.h
 #include host-utils.h
 
@@ -95,14 +96,13 @@ void ioapic_set_irq(void *opaque, int vector, int level)
 {
 IOAPICState *s = opaque;
 
-#if 0
 /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps
  * to GSI 2.  GSI maps to ioapic 1-1.  This is not
  * the cleanest way of doing it but it should work. */
 
-if (vector == 0)
+if (vector == 0  irq0override) {
 vector = 2;
-#endif
+}
 
 if (vector = 0  vector  IOAPIC_NUM_PINS) {
 uint32_t mask = 1  vector;
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 8cb6faa..2e52c87 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -879,7 +879,11 @@ int kvm_arch_init_irq_routing(void)
 return r;
 }
 for (i = 0; i  24; ++i) {
-r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i);
+if (i == 0) {
+r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, 2);
+} else if (i != 2) {
+r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i);
+}
 if (r  0)
 return r;
 }
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 8226001..c64718c 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -165,6 +165,7 @@ void qemu_kvm_cpu_stop(CPUState *env);
 #define kvm_enabled() (kvm_allowed)
 #define qemu_kvm_irqchip_in_kernel() kvm_irqchip_in_kernel(kvm_context)
 #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context)
+#define qemu_kvm_has_gsi_routing() kvm_has_gsi_routing(kvm_context)
 #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu()
 void kvm_init_vcpu(CPUState *env);
 void kvm_load_tsc(CPUState *env);
@@ -173,6 +174,7 @@ void kvm_load_tsc(CPUState *env);
 #define kvm_nested 0
 #define qemu_kvm_irqchip_in_kernel() (0)
 #define qemu_kvm_pit_in_kernel() (0)
+#define qemu_kvm_has_gsi_routing() (0)
 #define kvm_has_sync_mmu() (0)
 #define kvm_load_registers(env) do {} while(0)
 #define kvm_save_registers(env) do {} while(0)
diff --git a/sysemu.h b/sysemu.h
index 1f45fd6..292bbc3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -93,6 +93,7 @@ extern int graphic_width;
 extern int graphic_height;
 extern int graphic_depth;
 extern int nographic;
+extern int irq0override;
 extern const char *keyboard_layout;
 extern int win2k_install_hack;
 extern int rtc_td_hack;
diff --git a/vl.c b/vl.c
index 6b4b7d2..f9a72b3 100644
--- a/vl.c
+++ b/vl.c
@@ -207,6 +207,7 @@ static int vga_ram_size;
 enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB;
 static DisplayState *display_state;
 int nographic;
+int irq0override;
 static int curses;
 static int sdl;
 const char* keyboard_layout = NULL;
@@ -5039,6 +5040,7 @@ int main(int argc, char **argv, char **envp)
 vga_ram_size = VGA_RAM_SIZE;
 snapshot = 0;
 nographic = 0;
+irq0override = 1;
 curses = 0;
 kernel_filename = NULL;
 kernel_cmdline = ;
@@ -6135,8 +6137,14 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-if (kvm_enabled())
-   kvm_init_ap();
+if (kvm_enabled()) {
+   kvm_init_ap();
+#ifdef USE_KVM
+if (kvm_irqchip  !qemu_kvm_has_gsi_routing()) {
+irq0override = 0;
+}
+#endif
+}
 
 machine-init(ram_size, vga_ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

 


This is my preferred option.  For a virtio-net-server in the kernel,
we'd service its eventfd in qemu, raising and lowering the pci
interrupt in the traditional way.

But we'd still need to know when to lower the interrupt.  How?



IIUC, isn't that  usually device/subsystem specific, and out of scope of
the GSI delivery vehicle?  For instance, most devices I have seen with
level ints have a register in their device register namespace for acking
the int.  


Yes it is.


As an aside, this is what causes some of the grief in dealing
with shared interrupts like KVM pass-through and/or threaded-isrs:  
There isn't a standardized way to ACK them.
  


So we'd need a side channel to tell userspace to lower the irq.  Another 
eventfd likely.


Note we don't support device assignment for devices with shared interrupts.


You may also see some generalization of masking/acking in things like
the MSI-X table.  But again, this would be out of scope of the general
GSI delivery path IIUC.

I understand that there is a feedback mechanism in the ioapic model for
calling back on acknowledgment of the interrupt.  But I am not sure what
is how the real hardware works normally, and therefore I am not
convinced that is something we need to feed all the way back (i.e. via
irqfd or whatever).  In the interest of full disclosure, its been a few
years since I studied the xAPIC docs, so I might be out to lunch on that
assertion. ;)
  


Right, that ack thing is completely internal, used for catching up on 
time drift, and for shutting down level triggered assigned interrupts.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM

2009-05-07 Thread Avi Kivity


(copying Ingo)

Mark Langsdorf wrote:

commit 01813db8627e74018c8cec90df7e345839351f23
Author: root r...@xendinar01.amd.com
Date:   Thu May 7 09:44:10 2009 -0500

New AMD processors will support the Pause Filter Feature.
This feature creates a new field in the VMCB called Pause
Filter Count.  If Pause Filter Count is greater than 0 and
intercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,

especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held

for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID

bit.

On a 24 core system running 4 guests each with 16 VCPUs,

this patch improved overall performance of each guest's
32 job kernbench by approximately 1%.  Further performance
improvement may be possible with a more sophisticated
yield algorithm.

-Mark Langsdorf

Operating System Research Center
AMD

Signed-off-by: Mark Langsdorf mark.langsd...@amd.com
  


(please use git format-patch rather than git show, and set up user.name 
and user.email properly)


 
 	svm-nested_vmcb = 0;

svm-vcpu.arch.hflags = HF_GIF_MASK;
+
+   if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+   control-pause_filter_count = 5000;
+   control-intercept |= (1ULL  INTERCEPT_PAUSE);
+   }
+
 }


3000 or 5000?

 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)

+{
+   /* Simple yield */
+   vcpu_put(svm-vcpu);
+   schedule();
+   vcpu_load(svm-vcpu);
+   return 1;
+


Ingo, will this do anything under CFS, or will CFS note that nothing has 
changed in the accounting and reschedule us immediately?




--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] BIOS changes for configuring irq0-inti2 override (v2)

2009-05-07 Thread Avi Kivity


Beth Kon wrote:

These patches resolve the irq0-inti2 override issue, and get the hpet working
on kvm. 


They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch.

Override and HPET changes are sent as a series because HPET depends on the 
override. Win2k8 expects the HPET interrupt on inti2, regardless of whether 
an override exists in the BIOS. And the HPET spec states that in legacy mode, 
timer interrupt is on inti2.


The irq0-inti2 override will always be used unless the kernel cannot do irq 
routing (i.e., compatibility with old kernels). So if the kernel is capable, 
userspace sets up irq0-inti2 via the irq routing interface, and adds the 
irq0-inti2 override to the MADT interrupt source override table, 
and the mp table (for the no-acpi case).


A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid
checksum with these patches, but later he couldn't reproduce it, and I'm not 
seeing it now. While all guests still need to be fully tested, everything 
appears to be in order.  I've tested on win2k864, win2k832, RHEL5.3 32 bit, 
and ubuntu 8.10 64 bit. 

  


What are the changes relative to v1?


@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
  


As noted, this should be in the arch local space.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Shared memory device with interrupt support

2009-05-07 Thread Cam Macdonell

Support an inter-vm shared memory device that maps a shared-memory object 
as a PCI device in the guest.  This patch also supports interrupts between 
guest by communicating over a unix domain socket.  This patch applies to the 
qemu-kvm repository. 

This device now creates a qemu character device and sends 1-bytes messages to 
trigger interrupts.  Writes are trigger by writing to the Doorbell register 
on the shared memory PCI device.  The lower 8-bits of the value written to this 
register are sent as the 1-byte message so different meanings of interrupts can 
be supported.

Interrupts are only supported between 2 VMs currently.  One VM must act as the 
server by adding server to the command-line argument.  Shared memory devices 
are created with the following command-line:

-ivhshmem shm object,size in MB,[unix:path][,server] 

Interrupts can also be used between host and guest as well by implementing a 
listener on the host.

Cam

---
 Makefile.target |3 +
 hw/ivshmem.c|  421 +++
 hw/pc.c |6 +
 hw/pc.h |3 +
 qemu-options.hx |   14 ++
 sysemu.h|8 +
 vl.c|   14 ++
 7 files changed, 469 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index b68a689..3190bba 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -643,6 +643,9 @@ OBJS += pcnet.o
 OBJS += rtl8139.o
 OBJS += e1000.o
 
+# Inter-VM PCI shared memory
+OBJS += ivshmem.o
+
 # Generic watchdog support and some watchdog devices
 OBJS += watchdog.o
 OBJS += wdt_ib700.o wdt_i6300esb.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 000..95e2268
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,421 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *  Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include hw.h
+#include console.h
+#include pc.h
+#include pci.h
+#include sysemu.h
+
+#include qemu-common.h
+#include sys/mman.h
+
+#define PCI_COMMAND_IOACCESS0x0001
+#define PCI_COMMAND_MEMACCESS   0x0002
+#define PCI_COMMAND_BUSMASTER   0x0004
+
+//#define DEBUG_IVSHMEM
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)\
+do {printf(IVSHMEM:  fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct IVShmemState {
+uint16_t intrmask;
+uint16_t intrstatus;
+uint16_t doorbell;
+uint8_t *ivshmem_ptr;
+unsigned long ivshmem_offset;
+unsigned int ivshmem_size;
+unsigned long bios_offset;
+unsigned int bios_size;
+target_phys_addr_t base_ctrl;
+int it_shift;
+PCIDevice *pci_dev;
+CharDriverState * chr;
+unsigned long map_addr;
+unsigned long map_end;
+int ivshmem_mmio_io_addr;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+PCIDevice dev;
+IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+char name[1024];
+char * chrdev;
+int size;
+} IVShmemDesc;
+
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 16,
+Doorbell = 32
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+uint32_t addr, uint32_t size, int type)
+{
+PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+IVShmemState *s = d-ivshmem_state;
+
+IVSHMEM_DPRINTF(addr = %u size = %u\n, addr, size);
+cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+char * temp;
+char * ivshmem_sz;
+int size;
+
+num_ivshmem_devices++;
+
+/* currently we only support 1 device */
+if (num_ivshmem_devices  MAX_IVSHMEM_DEVICES) {
+return;
+}
+
+temp = strdup(optarg);
+snprintf(ivshmem_desc.name, 1024, /%s, strsep(temp,,));
+ivshmem_sz=strsep(temp,,);
+if (ivshmem_sz != NULL){
+size = atol(ivshmem_sz);
+} else {
+size = -1;
+}
+
+ivshmem_desc.chrdev = strsep(temp,\0);
+
+if ( size == -1) {
+ivshmem_desc.size = TARGET_PAGE_SIZE;
+} else {
+ivshmem_desc.size = size*1024*1024;
+}
+IVSHMEM_DPRINTF(optarg is %s, name is %s, size is %d, chrdev is %s\n,
+optarg, ivshmem_desc.name,
+ivshmem_desc.size, 
ivshmem_desc.chrdev);
+}
+
+int ivshmem_get_size(void) {
+return ivshmem_desc.size;
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+int isr;
+isr = (s-intrstatus  s-intrmask)  0x;
+
+/* don't print ISR resets */
+if (isr) {
+IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n,
+   isr

bridges

2009-05-07 Thread Ross Boylan

I'm trying to understand bridging with KVM, but am still puzzled.
I think that the recommended bridging with TAP means that packets from
the VM will end up going out the host card attached to the default
gateway.  But it looks to me as if their IP address is unchanged, which
means replies will never reach me.  Is that correct?  Do I need to NAT
the packets, or is something already doing that?

Some documents indicate that I need to bring the interfaces (e.g., eth0)
down before I bring the bridge up, and that afterwards only the bridge
will have an IP address.  Is that right?

Some documents, e.g.,
http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate
iptables should just work with bridging.  However, I've seen someone
with a 2.6.15 kernel ask about firewalling and be told they needed to
patch the kernel to get it work (don't have the reference handy).
Should it just work?

I'm running a 2.6.29 kernel on Debian Lenny with kvm 72+dfsg-5~lenny1.
Version 84+dfsg-2 is available in experimental.  Is there much to be
gained by going with the more recent version?

Please cc me; I'm not on the list.

Thanks.
Ross Boylan


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bios: Use a different mask to size the option ROM BAR

2009-05-07 Thread Alex Williamson

Bit 0 is the enable bit, which we not only don't want to set, but
it will stick and make us think it's an I/O port resource.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 bios/rombios32.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bios/rombios32.c b/bios/rombios32.c
index 269f175..616e70a 100644
--- a/bios/rombios32.c
+++ b/bios/rombios32.c
@@ -948,11 +948,13 @@ static void pci_bios_init_device(PCIDevice *d)
 int ofs;
 uint32_t val, size ;
 
-if (i == PCI_ROM_SLOT)
+if (i == PCI_ROM_SLOT) {
 ofs = 0x30;
-else
+pci_config_writel(d, ofs, 0xfffe);
+} else {
 ofs = 0x10 + i * 4;
-pci_config_writel(d, ofs, 0x);
+pci_config_writel(d, ofs, 0x);
+}
 val = pci_config_readl(d, ofs);
 if (val != 0) {
 size = (~(val  ~0xf)) + 1;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Driver for Inter-VM shared memory device for KVM supporting interrupts.

2009-05-07 Thread Cam Macdonell

Driver for inter-VM shared memory device that now supports interrupts between 
two guests.  The driver defines a counting semaphore and wait_event queue for 
different synchronization needs of users.  Initializing the semaphore count, 
sending interrupts and waiting are implemented via ioctl calls. 

The synchronization mechanisms are simple and rely on existing kernel 
primitives, but I think they're flexible for synchronization between guests.  
I'm contemplating more complicated designs that would use the shared memory to 
store synchronization variables, but thought I would get this initial patch out 
to get some feedback. 

Cheers,
Cam 

---
 drivers/char/Kconfig   |8 +
 drivers/char/Makefile  |2 +
 drivers/char/kvm_ivshmem.c |  430 
 3 files changed, 440 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/kvm_ivshmem.c

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..afa7cb8 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,14 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config KVM_IVSHMEM
+tristate Inter-VM Shared Memory Device
+depends on PCI
+default m
+help
+  This device maps a region of shared memory between the host OS and any
+  number of virtual machines.
+
 source drivers/s390/char/Kconfig
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..021f06b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o
 obj-$(CONFIG_JS_RTC)   += js-rtc.o
 js-rtc-y = rtc.o
 
+obj-$(CONFIG_KVM_IVSHMEM)  += kvm_ivshmem.o
+
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
 
diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c
new file mode 100644
index 000..a20a224
--- /dev/null
+++ b/drivers/char/kvm_ivshmem.c
@@ -0,0 +1,430 @@
+/*
+ * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI 
device
+ *
+ * Copyright 2009 Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based on cirrusfb.c and 8139cp.c:
+ * Copyright 1999-2001 Jeff Garzik
+ * Copyright 2001-2004 Jeff Garzik
+ *
+ */
+
+#include linux/init.h
+#include linux/kernel.h
+#include linux/module.h
+#include linux/pci.h
+#include linux/proc_fs.h
+#include linux/smp_lock.h
+#include asm/uaccess.h
+#include linux/interrupt.h
+#include linux/mutex.h
+
+#define TRUE 1
+#define FALSE 0
+#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0
+
+enum {
+/* KVM Inter-VM shared memory device register offsets */
+IntrMask= 0x00,/* Interrupt Mask */
+IntrStatus  = 0x10,/* Interrupt Status */
+Doorbell= 0x20,/* Doorbell */
+ShmOK = 1/* Everything is OK */
+};
+
+typedef struct kvm_ivshmem_device {
+void __iomem * regs;
+
+void * base_addr;
+
+unsigned int regaddr;
+unsigned int reg_size;
+
+unsigned int ioaddr;
+unsigned int ioaddr_size;
+unsigned int irq;
+
+bool enabled;
+
+} kvm_ivshmem_device;
+
+static int event_num;
+static struct semaphore sema;
+static wait_queue_head_t wait_queue;
+
+static kvm_ivshmem_device kvm_ivshmem_dev;
+
+static int device_major_nr;
+
+static int kvm_ivshmem_ioctl(struct inode *, struct file *, unsigned int, 
unsigned long);
+static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *);
+static int kvm_ivshmem_open(struct inode *, struct file *);
+static int kvm_ivshmem_release(struct inode *, struct file *);
+static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *);
+static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t 
*);
+static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin);
+
+enum ivshmem_ioctl { set_sema, down_sema, sema_irq, wait_event, wait_event_irq 
};
+
+static const struct file_operations kvm_ivshmem_ops = {
+.owner   = THIS_MODULE,
+.open= kvm_ivshmem_open,
+.mmap= kvm_ivshmem_mmap,
+.read= kvm_ivshmem_read,
+.ioctl   = kvm_ivshmem_ioctl,
+.write   = kvm_ivshmem_write,
+.llseek  = kvm_ivshmem_lseek,
+.release = kvm_ivshmem_release,
+};
+
+static struct pci_device_id kvm_ivshmem_id_table[] = {
+{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
+{ 0 },
+};
+MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table);
+
+static void kvm_ivshmem_remove_device(struct pci_dev* pdev);
+static int kvm_ivshmem_probe_device (struct pci_dev *pdev,
+const struct pci_device_id * ent);
+
+static struct pci_driver kvm_ivshmem_pci_driver = {
+.name= kvm-shmem,
+.id_table= kvm_ivshmem_id_table,
+.probe   = kvm_ivshmem_probe_device,
+.remove  = kvm_ivshmem_remove_device,
+};
+
+static int kvm_ivshmem_ioctl(struct inode * ino, struct file * filp,
+unsigned int cmd,

Re: [PATCH 1/4] BIOS changes for configuring irq0-inti2 override(v2)

2009-05-07 Thread Beth Kon


Avi Kivity wrote:

Beth Kon wrote:
These patches resolve the irq0-inti2 override issue, and get the 
hpet working

on kvm.
They are dependent on Jes Sorensen's recent 
0006-qemu-kvm-irq-routing.patch.


Override and HPET changes are sent as a series because HPET depends 
on the override. Win2k8 expects the HPET interrupt on inti2, 
regardless of whether an override exists in the BIOS. And the HPET 
spec states that in legacy mode, timer interrupt is on inti2.


The irq0-inti2 override will always be used unless the kernel cannot 
do irq routing (i.e., compatibility with old kernels). So if the 
kernel is capable, userspace sets up irq0-inti2 via the irq routing 
interface, and adds the irq0-inti2 override to the MADT interrupt 
source override table, and the mp table (for the no-acpi case).


A couple of months ago, Marcelo was seeing RHEL5 guests complain of 
invalid
checksum with these patches, but later he couldn't reproduce it, and 
I'm not seeing it now. While all guests still need to be fully 
tested, everything appears to be in order.  I've tested on win2k864, 
win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit.
  


What are the changes relative to v1?
Just merge issues with the changes you put in when moving to the newer 
bios. I submitted prematurely, incorrectly thinking I was done testing. 
When I finished, some problems surfaced.



@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
  


As noted, this should be in the arch local space.

The base changes were not in the code yet. As we discussed on IRC, I'll 
resubmit once they're there.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Davide Libenzi

On Thu, 7 May 2009, Avi Kivity wrote:

 Davide Libenzi wrote:

   What's your take on adding irq context safe callbacks to irqfd?
   
   To give some background here, we would like to use eventfd as a generic
   connector between components, so the components do not know about each
   other.
   So far eventfd successfully abstracts among components in the same
   process, in
   different processes, and in the kernel.
   
   eventfd_signal() can be safely called from irq context, and will wake up a
   waiting task.  But in some cases, if the consumer is in the kernel, it may
   be
   able to consume the event from irq context, saving a context switch.
   
   So, will you consider patches adding this capability to eventfd?
   
  
  Maybe I got lost in the thread, but inside the kernel we have callback-based
  wakeup since long time. This is what epoll uses, when hooking into the file*
  f_op-poll() subsystem.
  Did you mean something else?

 
 Do you mean wait_queue_t::func?

Yes, it is since long time ago that a wakeup in Linux can lead either to a 
real wakeup (in scheduler terms), or to a simple function call.
Isn't that enough?


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Chris Wright wrote:
 * Gregory Haskins (ghask...@novell.com) wrote:
   
 Chris Wright wrote:
 
 VF drivers can also have this issue (and typically use mmio).
 I at least have a better idea what your proposal is, thanks for
 explanation.  Are you able to demonstrate concrete benefit with it yet
 (improved latency numbers for example)?
   
 I had a test-harness/numbers for this kind of thing, but its a bit
 crufty since its from ~1.5 years ago.  I will dig it up, update it, and
 generate/post new numbers.
 

 That would be useful, because I keep coming back to pio and shared
 page(s) when think of why not to do this.  Seems I'm not alone in that.

 thanks,
 -chris
   

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' show me the numbers and Anthony's Why
reinvent the wheel? questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.

-Greg



signature.asc
Description: OpenPGP digital signature

[PATCH] kvm: device-assignment: Catch GSI overflow

2009-05-07 Thread Alex Williamson

Fix the index at which we return -ENOSPC since the kernel side will
reject a GSI = KVM_MAX_IRQ_ROUTES.  Also, mask as a signed int before
testing for error.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 hw/device-assignment.c |2 +-
 kvm/libkvm/libkvm.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a7365c8..e06dd08 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -796,7 +796,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 pci_dev-cap.start + PCI_MSI_DATA_32);
 assigned_dev-entry-type = KVM_IRQ_ROUTING_MSI;
 assigned_dev-entry-gsi = kvm_get_irq_route_gsi(kvm_context);
-if (assigned_dev-entry-gsi  0) {
+if ((int)(assigned_dev-entry-gsi)  0) {
 perror(assigned_dev_update_msi: kvm_get_irq_route_gsi);
 return;
 }
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index ba0a5d1..2a4165a 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -1408,7 +1408,7 @@ int kvm_get_irq_route_gsi(kvm_context_t kvm)
 {
 #ifdef KVM_CAP_IRQ_ROUTING
if (kvm-max_used_gsi = KVM_IOAPIC_NUM_PINS)  {
-   if (kvm-max_used_gsi = kvm_get_gsi_count(kvm))
+   if (kvm-max_used_gsi + 1  kvm_get_gsi_count(kvm))
 return kvm-max_used_gsi + 1;
 else
 return -ENOSPC;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Catch GSI overflow

2009-05-07 Thread Alex Williamson

On Thu, 2009-05-07 at 11:09 -0600, Alex Williamson wrote:
 Fix the index at which we return -ENOSPC since the kernel side will
 reject a GSI = KVM_MAX_IRQ_ROUTES.  Also, mask as a signed int before
 testing for error.

Even with this, there still seems to be a fundamental problem with our
consumption of GSIs in kvm.  For example, every time a guest writes to
the MSI capabilities area and enables MSI support, we call
kvm_get_irq_route_gsi() to get a new max_used_gsi + 1 value, then call
kvm_add_routing_entry(), which updates max_used_gsi.  It doesn't take
too long before we exhaust the GSI space and the device no longer works.
This seems to happen within a minute or two of booting a guest with an
e1000e device sitting idle on a busy network.  Do we need to keep a
bitmap of used GSIs or maybe just attempt to reuse the GSI we've gotten
previously for the device?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bridges

2009-05-07 Thread Ross Boylan

On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote:
 
 Ross Boylan wrote:
  I'm trying to understand bridging with KVM, but am still puzzled.
  I think that the recommended bridging with TAP means that packets
 from
  the VM will end up going out the host card attached to the default
  gateway.  But it looks to me as if their IP address is unchanged,
 which
  means replies will never reach me.  Is that correct?  Do I need to
 NAT
  the packets, or is something already doing that?
 
 Hi Ross,
 
 This is the place to start
 
 http://www.linux-kvm.org/page/Networking.  
I saw that; it gives some recipes but I wasn't sure what their effect
was.

 You want a public bridge.
 
 I'm not sure what their and me mean in your email.  In short,
 with 
 bridging each VM has its own IP and that VM can be accessed directly 
 from the network.
their = the VM.
me = my host machine.

So if the VM's are running on their own subnet, e.g., 10.0.2.* (I've
been assuming the subnet with TAP is like the one with the User Mode
Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and
my host machine is on another net, e.g., 10.0.8.* then I think the
packet will go out with an IP of 10.0.2.2 (say).  When some other
machine tries to reply to 10.0.2.2, the packet gets lost because the
outside network thinks 10.0.2.* is not for it.  At least that's my
concern.  If the return IP address on the packet were 10.0.8.44
(supposing that's the IP of my host machine) then the packets could find
their way back.

My host machine is on an internal network with a 10.* IP.  The example
might be clearer if one supposed that the VM's were on a 192.168.*
network.

I am perhaps being influenced by the fact that I don't want to ask for
more IP's, so I don't want to configure the VM's to use an IP on our
10.0.8 network.

Does the TAP networking setup a whole subnet like the user mode network
stack (e.g., running a DHCP server), or is the idea that I would just
give the VM an IP on my subnet (10.0.8.*) in this example?

If the latter is the case (I'm now suspecting it is) then I think the
solution is clear.  I just stick the VM's on a private (to my machine)
subnet, like 192.168.*, and I do NAT on the packets as they go out.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' show me the numbers and Anthony's Why
reinvent the wheel? questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.
  


Well, 420 ns is not to be sneezed at.

What do you think of my mmio hypercall?  That will speed up all mmio to 
be as fast as a hypercall, and then we can use ordinary mmio/pio writes 
to trigger things.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Avi Kivity wrote:
 Gregory Haskins wrote:
 I completed the resurrection of the test and wrote up a little wiki on
 the subject, which you can find here:

 http://developer.novell.com/wiki/index.php/WhyHypercalls

 Hopefully this answers Chris' show me the numbers and Anthony's Why
 reinvent the wheel? questions.

 I will include this information when I publish the updated v2 series
 with the s/hypercall/dynhc changes.

 Let me know if you have any questions.
   

 Well, 420 ns is not to be sneezed at.

 What do you think of my mmio hypercall?  That will speed up all mmio
 to be as fast as a hypercall, and then we can use ordinary mmio/pio
 writes to trigger things.

I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

-Greg



signature.asc
Description: OpenPGP digital signature

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity


Davide Libenzi wrote:

On Thu, 7 May 2009, Avi Kivity wrote:

  

Davide Libenzi wrote:

  
  

What's your take on adding irq context safe callbacks to irqfd?

To give some background here, we would like to use eventfd as a generic
connector between components, so the components do not know about each
other.
So far eventfd successfully abstracts among components in the same
process, in
different processes, and in the kernel.

eventfd_signal() can be safely called from irq context, and will wake up a
waiting task.  But in some cases, if the consumer is in the kernel, it may
be
able to consume the event from irq context, saving a context switch.

So, will you consider patches adding this capability to eventfd?



Maybe I got lost in the thread, but inside the kernel we have callback-based
wakeup since long time. This is what epoll uses, when hooking into the file*
f_op-poll() subsystem.
Did you mean something else?
  
  

Do you mean wait_queue_t::func?



Yes, it is since long time ago that a wakeup in Linux can lead either to a 
real wakeup (in scheduler terms), or to a simple function call.

Isn't that enough?

  


It's perfect.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

What do you think of my mmio hypercall?  That will speed up all mmio
to be as fast as a hypercall, and then we can use ordinary mmio/pio
writes to trigger things.



I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

  


Looks like it isn't there.  But it isn't any different than set_pte - 
convert a write into a hypercall.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Avi Kivity wrote:
 Gregory Haskins wrote:
 What do you think of my mmio hypercall?  That will speed up all mmio
 to be as fast as a hypercall, and then we can use ordinary mmio/pio
 writes to trigger things.

 
 I like it!

 Bigger question is what kind of work goes into making mmio a pv_op (or
 is this already done)?

   

 Looks like it isn't there.  But it isn't any different than set_pte -
 convert a write into a hypercall.



I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the slow
path are an acceptable casualty, I think we can make that work.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking

2009-05-07 Thread Suresh Siddha

On Wed, 2009-05-06 at 23:16 -0700, Han, Weidong wrote:
 @@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header 
 *header,
   0x%Lx\n, scope-enumeration_id,
  drhd-address);
  
 + bus = pci_find_bus(drhd-segment, scope-bus);
 + path = (struct acpi_dmar_pci_path *)(scope + 1);
 + count = (scope-length -
 +  sizeof(struct acpi_dmar_device_scope))
 + / sizeof(struct acpi_dmar_pci_path);
 +
 + while (count) {
 + if (pdev)
 + pci_dev_put(pdev);
 +
 + if (!bus)
 + break;
 +
 + pdev = pci_get_slot(bus,
 + PCI_DEVFN(path-dev, path-fn));
 + if (!pdev)
 + break;

ir_parse_ioapic_scope() happens very early in the boot. So, I don't
think we can do the pci related discovery here.

thanks,
suresh

 +
 + path++;
 + count--;
 + bus = pdev-subordinate;
 + }
 +
 + if (pdev) { /* PCI discoverable IOAPIC*/
 + ir_ioapic[ir_ioapic_num].bus =
 + pdev-bus-number;
 + ir_ioapic[ir_ioapic_num].devfn = pdev-devfn;
 + } else { /* Not PCI discoverable IOAPIC */
 + if (!bus)
 + ir_ioapic[ir_ioapic_num].bus =
 + scope-bus;
 + else
 + ir_ioapic[ir_ioapic_num].bus =
 + bus-number;
 + ir_ioapic[ir_ioapic_num].devfn =
 + PCI_DEVFN(path-dev, path-fn);
 + }
 +
   ir_ioapic[ir_ioapic_num].iommu = iommu;
   ir_ioapic[ir_ioapic_num].id = scope-enumeration_id;
   ir_ioapic_num++;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the slow
path are an acceptable casualty, I think we can make that work.
  


That's my thinking exactly.

Note we can cheat further.  kvm already has a coalesced mmio feature 
where side-effect-free mmios are collected in the kernel and passed to 
userspace only when some other significant event happens.  We could pass 
those addresses to the guest and let it queue those writes itself, 
avoiding the hypercall completely.


Though it's probably pointless: if the guest is paravirtualized enough 
to have the mmio hypercall, then it shouldn't be using e1000.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Avi Kivity wrote:
 Gregory Haskins wrote:
 I guess technically mmio can just be a simple access of the page which
 would be problematic to trap locally without a PF.  However it seems
 that most mmio always passes through a ioread()/iowrite() call so this
 is perhaps the hook point.  If we set the stake in the ground that mmios
 that go through some other mechanism like PFs can just hit the slow
 path are an acceptable casualty, I think we can make that work.
   

 That's my thinking exactly.

Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the nullio ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.


 Note we can cheat further.  kvm already has a coalesced mmio feature
 where side-effect-free mmios are collected in the kernel and passed to
 userspace only when some other significant event happens.  We could
 pass those addresses to the guest and let it queue those writes
 itself, avoiding the hypercall completely.

 Though it's probably pointless: if the guest is paravirtualized enough
 to have the mmio hypercall, then it shouldn't be using e1000.

Yeah...plus at least for my vbus purposes, all my my guest-host
transitions are explicitly to cause side-effects, or I wouldn't be doing
them in the first place ;)  I suspect virtio-pci is exactly the same. 
I.e. the coalescing has already been done at a higher layer for
platforms running PV code.

Still a cool feature, tho.

-Greg




signature.asc
Description: OpenPGP digital signature

qemu: kvm: avoid harmless unhandled wrmsr 0xc0010117 messages

2009-05-07 Thread Marcelo Tosatti


Olders kernel which don't contain kvm.git commit
61a6bd672bda3b9468bf5895c1be085c4e481138 display the following message:

kvm: 32301: cpu0 unhandled wrmsr: 0xc0010117 data 0 

When kvm_arch_load_regs is called. This is confusing in bug reports.

Avoid it by checking whether the host advertises the MSR, similarly to
how MSR_STAR is handled.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 06ef775..5d19705 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -25,6 +25,7 @@
 static struct kvm_msr_list *kvm_msr_list;
 extern unsigned int kvm_shadow_memory;
 static int kvm_has_msr_star;
+static int kvm_has_vm_hsave_pa;
 
 static int lm_capable_kernel;
 
@@ -54,10 +55,14 @@ int kvm_arch_qemu_create_context(void)
 kvm_msr_list = kvm_get_msr_list(kvm_context);
 if (!kvm_msr_list)
return -1;
-for (i = 0; i  kvm_msr_list-nmsrs; ++i)
+for (i = 0; i  kvm_msr_list-nmsrs; ++i) {
if (kvm_msr_list-indices[i] == MSR_STAR)
kvm_has_msr_star = 1;
-   return 0;
+if (kvm_msr_list-indices[i] == MSR_VM_HSAVE_PA)
+kvm_has_vm_hsave_pa = 1;
+}
+
+return 0;
 }
 
 static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index,
@@ -260,7 +265,8 @@ void kvm_arch_load_regs(CPUState *env)
 set_msr_entry(msrs[n++], MSR_IA32_SYSENTER_EIP, env-sysenter_eip);
 if (kvm_has_msr_star)
set_msr_entry(msrs[n++], MSR_STAR,  env-star);
-set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
+if (kvm_has_vm_hsave_pa)
+set_msr_entry(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 set_msr_entry(msrs[n++], MSR_CSTAR, env-cstar);
@@ -435,7 +441,8 @@ void kvm_arch_save_regs(CPUState *env)
 if (kvm_has_msr_star)
msrs[n++].index = MSR_STAR;
 msrs[n++].index = MSR_IA32_TSC;
-msrs[n++].index = MSR_VM_HSAVE_PA;
+if (kvm_has_vm_hsave_pa)
+msrs[n++].index = MSR_VM_HSAVE_PA;
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3

2009-05-07 Thread Marcelo Tosatti

On Thu, May 07, 2009 at 05:16:35PM +0300, Avi Kivity wrote:
 mtosa...@redhat.com wrote:
 Disallow the deletion of memory slots (and aliases, for x86 case), if a
 vcpu contains a cr3 that points to such slot/alias.
   

 That allows the guest to induce failures in the host.  

I don't understand what you mean. What is the problem with returning
errors in the ioctl handlers?

The guest can cause an overflow in qemu, overwrite the parameters to
KVM_GET_MSR_INDEX_LIST in an attempt to read kernel data, and get
-E2BIG. Or pick your combination.

 Better to triple-fault the guest instead.

Sure can additionally triple fault it, but the kernel might attempt to
access the non-existant slot which cr3 points to before TRIPLE_FAULT is
processed. So you have to avoid that possibility in the first place,
thats why the patch modifies the ioctls to fail.

  +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot 
 *slot)
 +{
 +return 1;
 +}
 +
   

 In general, instead of stubs in every arch, have x86 say  
 KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when  
 that define is not present.

Will fix that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:


I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the slow
path are an acceptable casualty, I think we can make that work.
  
  

That's my thinking exactly.



Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the nullio ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  


Don't - it's broken.  It will also catch device assignment mmio and 
hypercall them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Avi Kivity wrote:
 Gregory Haskins wrote:
 Avi Kivity wrote:
  
 Gregory Haskins wrote:

 I guess technically mmio can just be a simple access of the page which
 would be problematic to trap locally without a PF.  However it seems
 that most mmio always passes through a ioread()/iowrite() call so this
 is perhaps the hook point.  If we set the stake in the ground that
 mmios
 that go through some other mechanism like PFs can just hit the slow
 path are an acceptable casualty, I think we can make that work.
 
 That's my thinking exactly.
 

 Cool,  I will code this up and submit it.  While Im at it, Ill run it
 through the nullio ringer, too. ;)  It would be cool to see the
 pv-mmio hit that 2.07us number.  I can't think of any reason why this
 will not be the case.
   

 Don't - it's broken.  It will also catch device assignment mmio and
 hypercall them.

Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?

-Greg



signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright

* Avi Kivity (a...@redhat.com) wrote:
 Gregory Haskins wrote:
 Cool,  I will code this up and submit it.  While Im at it, Ill run it
 through the nullio ringer, too. ;)  It would be cool to see the
 pv-mmio hit that 2.07us number.  I can't think of any reason why this
 will not be the case.

 Don't - it's broken.  It will also catch device assignment mmio and  
 hypercall them.

Not necessarily.  It just needs to be creative w/ IO_COND
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Chris Wright wrote:
 * Avi Kivity (a...@redhat.com) wrote:
   
 Gregory Haskins wrote:
 
 Cool,  I will code this up and submit it.  While Im at it, Ill run it
 through the nullio ringer, too. ;)  It would be cool to see the
 pv-mmio hit that 2.07us number.  I can't think of any reason why this
 will not be the case.
   
 Don't - it's broken.  It will also catch device assignment mmio and  
 hypercall them.
 

 Not necessarily.  It just needs to be creative w/ IO_COND
   

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?

-Greg



signature.asc
Description: OpenPGP digital signature

Re: bridges

2009-05-07 Thread Cam Macdonell


Ross Boylan wrote:

On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote:

Ross Boylan wrote:

I'm trying to understand bridging with KVM, but am still puzzled.
I think that the recommended bridging with TAP means that packets

from

the VM will end up going out the host card attached to the default
gateway.  But it looks to me as if their IP address is unchanged,

which

means replies will never reach me.  Is that correct?  Do I need to

NAT

the packets, or is something already doing that?

Hi Ross,

This is the place to start

http://www.linux-kvm.org/page/Networking.  

I saw that; it gives some recipes but I wasn't sure what their effect
was.


You want a public bridge.

I'm not sure what their and me mean in your email.  In short,
with 
bridging each VM has its own IP and that VM can be accessed directly 
from the network.

their = the VM.
me = my host machine.

So if the VM's are running on their own subnet, 


VMs do not run on their own subnet with bridged networking.


e.g., 10.0.2.* (I've
been assuming the subnet with TAP is like the one with the User Mode
Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and
my host machine is on another net, e.g., 10.0.8.* then I think the
packet will go out with an IP of 10.0.2.2 (say).  When some other
machine tries to reply to 10.0.2.2, the packet gets lost because the
outside network thinks 10.0.2.* is not for it.  At least that's my
concern.  If the return IP address on the packet were 10.0.8.44
(supposing that's the IP of my host machine) then the packets could find
their way back.


Using bridged networking is very different from the user stack.  The 
user stack is extremely limited and slow.




My host machine is on an internal network with a 10.* IP.  The example
might be clearer if one supposed that the VM's were on a 192.168.*
network.

I am perhaps being influenced by the fact that I don't want to ask for
more IP's, so I don't want to configure the VM's to use an IP on our
10.0.8 network.


Then you probably want to use a NAT network.  A NAT setup puts all the 
VMs on their own network within the host machine.  iptables is necessary 
to forward the subnet packets out to the world and back.


Here is some older documentation, but not much has changed.  Look at the 
first entry under Advanced Networking.


https://help.ubuntu.com/community/KVMFeisty


Does the TAP networking setup a whole subnet like the user mode network
stack (e.g., running a DHCP server), or is the idea that I would just
give the VM an IP on my subnet (10.0.8.*) in this example?


No, bridge networking using taps (one tap per VM) and effectively sits 
all the VMs on the same network your host is on.  You would need to get 
IPs from sysadmin for each VM.



If the latter is the case (I'm now suspecting it is) then I think the
solution is clear.  I just stick the VM's on a private (to my machine)
subnet, like 192.168.*, and I do NAT on the packets as they go out.


NAT is a very common solution.  Use VDE (vde.sourceforget.net) to create 
a virtual switch on your host for the VMs.  dnsmasq can serve dynamic 
IPs to the VMs on their own subnet that doesn't bother your sysadmin at 
all.  Use iptables to forward and receive packets through your host's 
NIC.


Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright

* Gregory Haskins (ghask...@novell.com) wrote:
 Chris Wright wrote:
  * Avi Kivity (a...@redhat.com) wrote:
  Gregory Haskins wrote:
  Cool,  I will code this up and submit it.  While Im at it, Ill run it
  through the nullio ringer, too. ;)  It would be cool to see the
  pv-mmio hit that 2.07us number.  I can't think of any reason why this
  will not be the case.

  Don't - it's broken.  It will also catch device assignment mmio and  
  hypercall them.
 
  Not necessarily.  It just needs to be creative w/ IO_COND
 
 Hi Chris,
Could you elaborate?  How would you know which pages to hypercall and
 which to let PF?

Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Chris Wright wrote:

* Gregory Haskins (ghask...@novell.com) wrote:
  

Chris Wright wrote:


* Avi Kivity (a...@redhat.com) wrote:
  

Gregory Haskins wrote:


Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the nullio ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  
  
Don't - it's broken.  It will also catch device assignment mmio and  
hypercall them.


Not necessarily.  It just needs to be creative w/ IO_COND
  

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?



Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
  


I think we just past the too complicated threshold.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins

Chris Wright wrote:
 * Gregory Haskins (ghask...@novell.com) wrote:
   
 Chris Wright wrote:
 
 * Avi Kivity (a...@redhat.com) wrote:
   
 Gregory Haskins wrote:
 
 Cool,  I will code this up and submit it.  While Im at it, Ill run it
 through the nullio ringer, too. ;)  It would be cool to see the
 pv-mmio hit that 2.07us number.  I can't think of any reason why this
 will not be the case.
   
   
 Don't - it's broken.  It will also catch device assignment mmio and  
 hypercall them.
 
 Not necessarily.  It just needs to be creative w/ IO_COND
   
 Hi Chris,
Could you elaborate?  How would you know which pages to hypercall and
 which to let PF?
 

 Was just thinking of some ugly mangling of the addr (I'm not entirely
 sure what would work best).
   
Right, I get the part about flagging the address and then keying off
that flag in IO_COND (like we do for PIO vs MMIO).

What I am not clear on is how you would know to flag the address to
begin with.

Here's a thought: PV drivers can flag the IO (e.g. virtio-pci knows it
would never be a real device).  This means we route any io requests from
virtio-pci though pv_io_ops-mmio(), but not unflagged addresses.  This
is not as slick as boosting *everyones* mmio speed as Avi's original
idea would have, but it is perhaps a good tradeoff between the entirely
new namespace created by my original dynhc() proposal and leaving them
all PF based.

This way, its just like using my dynhc() proposal except the mmio-addr
is the substitute address-token (instead of the dynhc-vector). 
Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
ignored and it just slow-paths through the PF as it does today.  Dynhc()
would be dependent  on pv_ops.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Gregory Haskins wrote:

Don't - it's broken.  It will also catch device assignment mmio and
hypercall them.



Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?
  


Oh yes.  But don't call it dynhc - like Chris says it's the wrong semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT 
or something along these lines.  You won't be able to pass any data, but 
that's fine.  Registers are saved to memory anyway.


And btw, given that eventfd and the underlying infrastructure are so 
flexible, it's probably better to go back to your original irqfd gets 
fd from userspace just to be consistent everywhere.


(no, I'm not deliberately making you rewrite that patch again and 
again... it's going to be a key piece of infrastructure so I want to get 
it right)


Just to make sure we have everything plumbed down, here's how I see 
things working out (using qemu and virtio, use sed to taste):


1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx 
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available, 
data-not-available, and kick ports of virtio-net-server

5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space 
since this is complicated (shared level triggered interrupts)

7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the 
notify fds as notifyfd

10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the interrupt.  
One day the kvm developers get their act together and change it to 
inject the interrupt directly when the irqfd is signalled (which could 
be from the net softirq or somewhere similarly nasty).


virtio-net-server wires notifyfd according to its liking.  It may 
schedule a thread, or it may execute directly.


And they all lived happily ever after.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity


Avi Kivity wrote:


I think we just past the too complicated threshold.



And the can't spel threshold in the same sentence.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann

On Thursday 07 May 2009, Gregory Haskins wrote:
 I guess technically mmio can just be a simple access of the page which
 would be problematic to trap locally without a PF.  However it seems
 that most mmio always passes through a ioread()/iowrite() call so this
 is perhaps the hook point.  If we set the stake in the ground that mmios
 that go through some other mechanism like PFs can just hit the slow
 path are an acceptable casualty, I think we can make that work.
 
 Thoughts?

An mmio that goes through a PF is a bug, it's certainly broken on
a number of platforms, so performance should not be an issue there.

Note that are four commonly used interface classes for PIO/MMIO:

1. readl/writel: little-endian MMIO
2. inl/outl: little-endian PIO
3. ioread32/iowrite32: converged little-endian PIO/MMIO
4. __raw_readl/__raw_writel: native-endian MMIO without checks

You don't need to worry about the __raw_* stuff, as this should never
be used in device drivers.

As a simplification, you could mandate that all drivers that want to
use this get converted to the ioread/iowrite class of interfaces and
leave the others slow.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 135 matches

Mail list logo