Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change

2009-02-13 Thread Marc Bevand
On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier  wrote:
>
> Marc..  this is quite a serious bug you've reported.  Is there a
> reason you didn't report it earlier?

Because I only started hitting that bug a couple weeks ago after
having upgraded to a buggy kvm version.

> Is there a way to restructure the code and/or how it works so it's
> more clearly correct?

I am seriously concerned about the general design of qcow2. The code
base is more complex than it needs to be, the format itself is
susceptible to race conditions causing cluster leaks when updating
some internal datastructures, it gets easily fragmented, etc.

I am considering implementing a new disk image format that supports
base images, snapshots (of the guest state), clones (of the disk
content); that has a radically simpler design & code base; that is
always consistent "on disk"; that is friendly to delta diffing (ie.
space-efficient when used with ZFS snapshots or rsync); and that makes
use of checksumming & replication to detect & fix corruption of
critical data structures (ideally this should be implemented by the
filesystem, unfortunately ZFS is not available everywhere :D).

I believe the key to achieve these (seemingly utopian) goals is to
represent a disk "image" as a set of sparse files, 1 per
snapshot/clone.

-marc
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: add remove_id sysfs entry

2009-02-13 Thread Greg KH
On Fri, Feb 13, 2009 at 06:12:44PM -0800, Chris Wright wrote:
> This adds a remove_id sysfs entry to allow users of new_id to later
> remove the added dynid.  One use case is management tools that want to
> dynamically bind/unbind devices to pci-stub driver while devices are
> assigned to KVM guests.  Rather than having to track which driver was
> originally bound to the driver, a mangement tool can simply:
> 
> # echo "8086 10f5" > /sys/bus/pci/drivers/pci-stub/new_id

As you're adding a new sysfs file, can you please document it in
Documentation/ABI?  I know the original bind and unbind files aren't
documented there, they were created before the ABI/ directory showed up.

It would be nice to add them as well if anyone wants to :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Yu Zhao
On Sat, Feb 14, 2009 at 01:49:59AM +0800, Matthew Wilcox wrote:
> On Fri, Feb 13, 2009 at 05:56:44PM +0100, Andi Kleen wrote:
> > > + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
> > > + if (ctrl & PCI_SRIOV_CTRL_VFE) {
> > > + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
> > > + msleep(100);
> > 
> > That's really long. Hopefully that's really needed.
> 
> Yes and no.  The spec says:
> 
>   To allow components to perform internal initialization, system software
>   must wait for at least 100 ms after changing the VF Enable bit from
>   a 0 to a 1, before it is permitted to issue Configuration Requests to
>   the VFs which are enabled by that VF Enable bit.
> 
> So we don't have to wait here, but we do have to wait before exposing
> all these virtual functions to the rest of the system.  Should we add
> more complexity, perhaps spawn a thread to do it asynchronously, or add
> 0.1 seconds to device initialisation?  A question without an easy
> answer, iMO.

This clears the VF Enable bit only if the BIOS has set it, so it doesn't
always happen. Actually the `msleep(100)' should be `ssleep(1)' here,
according to the spec you showed us below. I remembered the waiting time
incorrectly as 100ms which is the requirment for setting the VF Enable
bit rather than clearing it.

> > > +
> > > + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> > > + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
> > > + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
> > > + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
> > > + if (!offset || (total > 1 && !stride))
> > > + return -EIO;
> > > +
> > > + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
> > > + i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
> > > + pgsz &= ~((1 << i) - 1);
> > > + if (!pgsz)
> > > + return -EIO;
> > 
> > All the error paths don't seem to undo the config space writes.
> > How will the devices behave with half initialized context?
> 
> I think we should clear the VF_ENABLE bit.  That action is also fraught
> with danger:

The VF Eanble bit hasn't been set yet :-) Actually the spec forbids the
s/w to write those registers (NumVFs, Supported Page Size, etc.) when the
enabling bit is set.

> 
>   If software Clears VF Enable, software must allow 1 second after VF
>   Enable is Cleared before reading any field in the SR-IOV Extended
>   Capability or the VF Migration State Array (see Section 3.3.15.1).
> 
> Another msleep(1000) here?  Not pretty, but what else can we do?
> 
> Not to mention the danger of something else innocently using lspci -
> to read a field in the extended capability -- I suspect we also need to
> block user config accesses before clearing this bit.

Yes, we should block user config access.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Yu Zhao
On Sat, Feb 14, 2009 at 12:56:44AM +0800, Andi Kleen wrote:
> Yu Zhao  writes:
> > +
> > +
> > +static int sriov_init(struct pci_dev *dev, int pos)
> > +{
> > +   int i;
> > +   int rc;
> > +   int nres;
> > +   u32 pgsz;
> > +   u16 ctrl, total, offset, stride;
> > +   struct pci_sriov *iov;
> > +   struct resource *res;
> > +   struct pci_dev *pdev;
> > +
> > +   if (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
> > +   dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)
> > +   return -ENODEV;
> > +
> 
> It would be a good idea to put a might_sleep() here just in 
> case the msleep happens below and drivers call it incorrectly.

Yes, will do.

> > +   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
> > +   if (ctrl & PCI_SRIOV_CTRL_VFE) {
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
> > +   msleep(100);
> 
> That's really long. Hopefully that's really needed.

It's needed according to SR-IOV spec, however, these lines clear
the VF Enable bit if the BIOS or something else has set it. So it
doesn't always run into this.

> > +
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
> > +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
> > +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
> > +   if (!offset || (total > 1 && !stride))
> > +   return -EIO;
> > +
> > +   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
> > +   i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
> > +   pgsz &= ~((1 << i) - 1);
> > +   if (!pgsz)
> > +   return -EIO;
> 
> All the error paths don't seem to undo the config space writes.
> How will the devices behave with half initialized context?

Since the VF Enable bit is cleared before the initialization, setting
others SR-IOV registers won't change state of the device. So it should
be OK even without undo these writes as long as the VF Enable bit is
not set.

Thanks,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pci: add remove_id sysfs entry

2009-02-13 Thread Chris Wright
This adds a remove_id sysfs entry to allow users of new_id to later
remove the added dynid.  One use case is management tools that want to
dynamically bind/unbind devices to pci-stub driver while devices are
assigned to KVM guests.  Rather than having to track which driver was
originally bound to the driver, a mangement tool can simply:

# echo "8086 10f5" > /sys/bus/pci/drivers/pci-stub/new_id
# echo -n :00:19.0 > /sys/bus/pci/devices/:00:19.0/driver/unbind
# echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind

Guest uses device

# echo "8086 10f5" > /sys/bus/pci/drivers/pci-stub/remove_id
# echo :00:19.0 > /sys/bus/pci/drivers_probe

Signed-off-by: Chris Wright 
---
 drivers/pci/pci-driver.c |   80 -
 1 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 93eac14..87a5ddb 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -99,6 +99,52 @@ store_new_id(struct device_driver *driver, const char *buf, 
size_t count)
 }
 static DRIVER_ATTR(new_id, S_IWUSR, NULL, store_new_id);
 
+/**
+ * store_remove_id - remove a PCI device ID from this driver
+ * @driver: target device driver
+ * @buf: buffer for scanning device ID data
+ * @count: input size
+ *
+ * Removes a dynamic pci device ID to this driver.
+ */
+static ssize_t
+store_remove_id(struct device_driver *driver, const char *buf, size_t count)
+{
+   struct pci_dynid *dynid, *n;
+   struct pci_driver *pdrv = to_pci_driver(driver);
+   __u32 vendor, device, subvendor = PCI_ANY_ID,
+   subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+   int fields = 0;
+   int retval = -ENODEV;
+
+   fields = sscanf(buf, "%x %x %x %x %x %x",
+   &vendor, &device, &subvendor, &subdevice,
+   &class, &class_mask);
+   if (fields < 2)
+   return -EINVAL;
+
+   spin_lock(&pdrv->dynids.lock);
+   list_for_each_entry_safe(dynid, n, &pdrv->dynids.list, node) {
+   struct pci_device_id *id = &dynid->id;
+   if ((id->vendor == vendor) &&
+   (id->device == device) &&
+   (subvendor == PCI_ANY_ID || id->subvendor == subvendor) &&
+   (subdevice == PCI_ANY_ID || id->subdevice == subdevice) &&
+   !((id->class ^ class) & class_mask)) {
+   list_del(&dynid->node);
+   kfree(dynid);
+   retval = 0;
+   break;
+   }
+   }
+   spin_unlock(&pdrv->dynids.lock);
+
+   if (retval)
+   return retval;
+   return count;
+}
+static DRIVER_ATTR(remove_id, S_IWUSR, NULL, store_remove_id);
+
 static void
 pci_free_dynids(struct pci_driver *drv)
 {
@@ -125,6 +171,20 @@ static void pci_remove_newid_file(struct pci_driver *drv)
 {
driver_remove_file(&drv->driver, &driver_attr_new_id);
 }
+
+static int
+pci_create_removeid_file(struct pci_driver *drv)
+{
+   int error = 0;
+   if (drv->probe != NULL)
+   error = driver_create_file(&drv->driver,&driver_attr_remove_id);
+   return error;
+}
+
+static void pci_remove_removeid_file(struct pci_driver *drv)
+{
+   driver_remove_file(&drv->driver, &driver_attr_remove_id);
+}
 #else /* !CONFIG_HOTPLUG */
 static inline void pci_free_dynids(struct pci_driver *drv) {}
 static inline int pci_create_newid_file(struct pci_driver *drv)
@@ -132,6 +192,11 @@ static inline int pci_create_newid_file(struct pci_driver 
*drv)
return 0;
 }
 static inline void pci_remove_newid_file(struct pci_driver *drv) {}
+static inline int pci_create_removeid_file(struct pci_driver *drv)
+{
+   return 0;
+}
+static inline void pci_remove_removeid_file(struct pci_driver *drv) {}
 #endif
 
 /**
@@ -852,13 +917,23 @@ int __pci_register_driver(struct pci_driver *drv, struct 
module *owner,
/* register with core */
error = driver_register(&drv->driver);
if (error)
-   return error;
+   goto out;
 
error = pci_create_newid_file(drv);
if (error)
-   driver_unregister(&drv->driver);
+   goto out_newid;
 
+   error = pci_create_removeid_file(drv);
+   if (error)
+   goto out_removeid;
+out:
return error;
+
+out_removeid:
+   pci_remove_newid_file(drv);
+out_newid:
+   driver_unregister(&drv->driver);
+   goto out;
 }
 
 /**
@@ -874,6 +949,7 @@ int __pci_register_driver(struct pci_driver *drv, struct 
module *owner,
 void
 pci_unregister_driver(struct pci_driver *drv)
 {
+   pci_remove_removeid_file(drv);
pci_remove_newid_file(drv);
driver_unregister(&drv->driver);
pci_free_dynids(drv);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.o

Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

(On kvm-83, gcc complains about passing argument 1 of
'vmx_load_host_state' from incompatible pointer type.)
  


Yes, should be to_vmx(vcpu).  Same value anyway so it works.



Thanks for the terrific support.  kvm rocks!
  


Thanks for the debugging help.  kvm users rock!

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Avi Kivity  writes:

> + vmx_load_host_state(vcpu);

Works here as well (on intel).  

(On kvm-83, gcc complains about passing argument 1 of
'vmx_load_host_state' from incompatible pointer type.)

Thanks for the terrific support.  kvm rocks!

Regards,
Matteo Frigo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Avi Kivity wrote:

Matteo Frigo wrote:

Matteo Frigo  writes:

 

Avi Kivity  writes:

   

Can you run the slightly modified gs.c (attached) and rerun on AMD?
The is to see if the runtime somehow restores gs.
  

Crashes as follows:

w2k3-64:~$ ./a.exe gs: 2b
gs:0x30: 7efdb000
Segmentation fault (core dumped)



A little bit more information:

w2k3-64:~$ gdb a.exe
GNU gdb 6.8.0.20080328-cvs (cygwin-special)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show 
copying"

and "show warranty" for details.
This GDB was configured as "i686-pc-cygwin"...
(no debugging symbols found)
(gdb) r
Starting program: /home/athena/a.exe [New thread 1620.0x6dc]
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77c2 not found.
Error while mapping shared library sections:
/cygdrive/c/WINDOWS/SysWOW64/ntdll32.dll: No such file or directory.
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New thread 1620.0x74c]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)

Program received signal SIGSEGV, Segmentation fault.
0x0040109d in main ()
(gdb) x/i $pc
0x40109d :mov%gs:0x30,%esi
(gdb) p/x $gs
$1 = 0x2b
(gdb)   


Okay, at least this makes some little bit of sense.  On both Intel and 
AMD, 'mov gs' clobbers gs.base as expected.  On AMD, something further 
down the line (some syscall likely) restores gs.base, but on Intel it 
doesn't.  When we avoid the syscall, we get a crash on AMD as well.




The attached patch fixes it for me.  Without this, rdmsr(KERNEL_GS_BASE) 
reads a stale value, which presumably Windows later writes back.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

diff --git a/kernel/x86/vmx.c b/kernel/x86/vmx.c
index 7507ce2..048460d 100644
--- a/kernel/x86/vmx.c
+++ b/kernel/x86/vmx.c
@@ -910,6 +910,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
 		data = vmcs_readl(GUEST_SYSENTER_ESP);
 		break;
 	default:
+		vmx_load_host_state(vcpu);
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			data = msr->data;


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Matteo Frigo  writes:

  

Avi Kivity  writes:



Can you run the slightly modified gs.c (attached) and rerun on AMD?
The is to see if the runtime somehow restores gs.
  

Crashes as follows:

w2k3-64:~$ ./a.exe 
gs: 2b

gs:0x30: 7efdb000
Segmentation fault (core dumped)



A little bit more information:

w2k3-64:~$ gdb a.exe
GNU gdb 6.8.0.20080328-cvs (cygwin-special)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-cygwin"...
(no debugging symbols found)
(gdb) r
Starting program: /home/athena/a.exe 
[New thread 1620.0x6dc]

Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77c2 not found.
Error while mapping shared library sections:
/cygdrive/c/WINDOWS/SysWOW64/ntdll32.dll: No such file or directory.
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New thread 1620.0x74c]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)

Program received signal SIGSEGV, Segmentation fault.
0x0040109d in main ()
(gdb) x/i $pc
0x40109d :   mov%gs:0x30,%esi
(gdb) p/x $gs
$1 = 0x2b
(gdb) 
  


Okay, at least this makes some little bit of sense.  On both Intel and 
AMD, 'mov gs' clobbers gs.base as expected.  On AMD, something further 
down the line (some syscall likely) restores gs.base, but on Intel it 
doesn't.  When we avoid the syscall, we get a crash on AMD as well.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Matteo Frigo  writes:

  

Avi Kivity  writes:



Can you run the slightly modified gs.c (attached) and rerun on AMD?
The is to see if the runtime somehow restores gs.
  

Crashes as follows:

w2k3-64:~$ ./a.exe 
gs: 2b

gs:0x30: 7efdb000
Segmentation fault (core dumped)



A little bit more information:

w2k3-64:~$ gdb a.exe
GNU gdb 6.8.0.20080328-cvs (cygwin-special)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-cygwin"...
(no debugging symbols found)
(gdb) r
Starting program: /home/athena/a.exe 
[New thread 1620.0x6dc]

Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77c2 not found.
Error while mapping shared library sections:
/cygdrive/c/WINDOWS/SysWOW64/ntdll32.dll: No such file or directory.
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New thread 1620.0x74c]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)

Program received signal SIGSEGV, Segmentation fault.
0x0040109d in main ()
(gdb) x/i $pc
0x40109d :   mov%gs:0x30,%esi
(gdb) p/x $gs
$1 = 0x2b
(gdb) 
  


Okay, at least this makes some little bit of sense.  On both Intel and 
AMD, 'mov gs' clobbers gs.base as expected.  On AMD, something further 
down the line (some syscall likely) restores gs.base, but on Intel it 
doesn't.  When we avoid the syscall, we get a crash on AMD as well.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: running windows xp created by vmware server on kvm/qemu

2009-02-13 Thread David S. Ahern


Martin Maurer wrote:
>> -Original Message-
>> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
>> Behalf Of David S. Ahern
>> Sent: Donnerstag, 12. Februar 2009 22:46
>> To: qemu-de...@nongnu.org; kvm-devel
>> Subject: running windows xp created by vmware server on kvm/qemu
>>
>> Has anyone successfully ported a Windows XP image from vmware server to
>> kvm/qemu?
> 
> See http://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE
> 
> Br, Martin

That did the trick. Thanks for sharing.

david


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Matteo Frigo  writes:

> Avi Kivity  writes:
>
>> Can you run the slightly modified gs.c (attached) and rerun on AMD?
>> The is to see if the runtime somehow restores gs.
>
> Crashes as follows:
>
> w2k3-64:~$ ./a.exe 
> gs: 2b
> gs:0x30: 7efdb000
> Segmentation fault (core dumped)

A little bit more information:

w2k3-64:~$ gdb a.exe
GNU gdb 6.8.0.20080328-cvs (cygwin-special)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-cygwin"...
(no debugging symbols found)
(gdb) r
Starting program: /home/athena/a.exe 
[New thread 1620.0x6dc]
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77d4 not found.
Error: dll starting at 0x77c2 not found.
Error while mapping shared library sections:
/cygdrive/c/WINDOWS/SysWOW64/ntdll32.dll: No such file or directory.
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New thread 1620.0x74c]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)

Program received signal SIGSEGV, Segmentation fault.
0x0040109d in main ()
(gdb) x/i $pc
0x40109d : mov%gs:0x30,%esi
(gdb) p/x $gs
$1 = 0x2b
(gdb) 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Avi Kivity  writes:

> Can you run the slightly modified gs.c (attached) and rerun on AMD?
> The is to see if the runtime somehow restores gs.

Crashes as follows:

w2k3-64:~$ ./a.exe 
gs: 2b
gs:0x30: 7efdb000
Segmentation fault (core dumped)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

More questions:
- is the bad 'mov gs' instruction reached on AMD?  or is it avoided
somehow?  What about bare metal?



The instruction is indeed reached on amd, and gs is 0x2b after
the instruction.  I don't know about bare metal.

  

- does the attached program fail when compiled and run in cygwin on an
AMD host?



The program runs as follows:

w2k3-64:/cygdrive/v$ gcc -O gs.c
w2k3-64:/cygdrive/v$ ./a.exe 
gs: 2b

gs:0x30: 7efdb000
test

  
- does setjmp()/longjmp() come from the Windows run-time library, or 
from cygwin?



The setjmp()/longjmp() is in the cygwin library /bin/cygwin1.dll .
bash calls longjmp() at the end of the expr built-in, which causes the
problem reported by the original poster.

I should also mention that, as an experiment, I have replaced the mov
gs,ax instruction with a couple of no-ops in cygwin1.dll, and cygwin
runs fine on Intel with this patch.

  


That was going to be my next question :)

Right now I don't understand how cygwin's longjmp() is supposed to work 
at all, given that it clobbers gs.


Can you run the slightly modified gs.c (attached) and rerun on AMD?  The 
is to see if the runtime somehow restores gs.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

#include 

int main()
{
unsigned short gs;
unsigned x;

asm ("mov %%gs, %0\n" : "=g"(gs));
asm ("movl %%gs:0x30, %0\n" : "=r"(x));

printf("gs: %x\n", gs);
printf("gs:0x30: %x\n", x);

asm volatile ("mov %0, %%gs\n" : : "g"(gs));

asm volatile ("movl %%gs:0x30, %0\n" : "=r"(x));

printf("gs:0x30: %x\n", x);

return 0;
}


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

More questions:
- is the bad 'mov gs' instruction reached on AMD?  or is it avoided
somehow?  What about bare metal?



The instruction is indeed reached on amd, and gs is 0x2b after
the instruction.  I don't know about bare metal.

  

- does the attached program fail when compiled and run in cygwin on an
AMD host?



The program runs as follows:

w2k3-64:/cygdrive/v$ gcc -O gs.c
w2k3-64:/cygdrive/v$ ./a.exe 
gs: 2b

gs:0x30: 7efdb000
test

  
- does setjmp()/longjmp() come from the Windows run-time library, or 
from cygwin?



The setjmp()/longjmp() is in the cygwin library /bin/cygwin1.dll .
bash calls longjmp() at the end of the expr built-in, which causes the
problem reported by the original poster.

I should also mention that, as an experiment, I have replaced the mov
gs,ax instruction with a couple of no-ops in cygwin1.dll, and cygwin
runs fine on Intel with this patch.

  


That was going to be my next question :)

Right now I don't understand how cygwin's longjmp() is supposed to work 
at all, given that it clobbers gs.


Can you run the slightly modified gs.c (attached) and rerun on AMD?  The 
is to see if the runtime somehow restores gs.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

#include 

int main()
{
unsigned short gs;
unsigned x;

asm ("mov %%gs, %0\n" : "=g"(gs));
asm ("movl %%gs:0x30, %0\n" : "=r"(x));

printf("gs: %x\n", gs);
printf("gs:0x30: %x\n", x);

asm volatile ("mov %0, %%gs\n" : : "g"(gs));

asm volatile ("movl %%gs:0x30, %0\n" : "=r"(x));

printf("gs:0x30: %x\n", x);

return 0;
}


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Avi Kivity  writes:

> More questions:
> - is the bad 'mov gs' instruction reached on AMD?  or is it avoided
> somehow?  What about bare metal?

The instruction is indeed reached on amd, and gs is 0x2b after
the instruction.  I don't know about bare metal.

> - does the attached program fail when compiled and run in cygwin on an
> AMD host?

The program runs as follows:

w2k3-64:/cygdrive/v$ gcc -O gs.c
w2k3-64:/cygdrive/v$ ./a.exe 
gs: 2b
gs:0x30: 7efdb000
test

> - does setjmp()/longjmp() come from the Windows run-time library, or 
> from cygwin?

The setjmp()/longjmp() is in the cygwin library /bin/cygwin1.dll .
bash calls longjmp() at the end of the expr built-in, which causes the
problem reported by the original poster.

I should also mention that, as an experiment, I have replaced the mov
gs,ax instruction with a couple of no-ops in cygwin1.dll, and cygwin
runs fine on Intel with this patch.

Regards,
Matteo Frigo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

- run a cygwin program in an infinite loop (while :; do :; done is
sufficient)
- 'info registers', look at gdt
- 'x/28x 0x$GDT'

I'm interested in offset 0x28, but please provide the whole thing for
sanity checking.



Here it is.  This is with npt=1, but npt=0 shows exactly the
same GDT.

(qemu) info registers
EAX=0022dad8 EBX=0023 ECX=61108b28 EDX=0043fea7
ESI=0014 EDI= EBP=0022c518 ESP=0022c4f4
EIP=610935e2 EFL=0202 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =002b   00c0f300
CS =0023   00c0fb00
SS =002b   00c0f300
DS =002b   00c0f300
FS =0053 7efdd000 0fff 0040f300
GS =002b 7efdb000  00c0f300
LDT=   
TR =0040 f84e8070 0068 8b00
GDT= f84e7000 006f
IDT= f84e7070 0fff
CR0=80050031 CR2=07ff7dd13000 CR3=2d62c000 CR4=06f8
DR0= DR1= DR2= DR3= 
DR6=0ff0 DR7=0400

FCW=037f FSW= [ST=0] FTW=00 MXCSR=
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00= XMM01=
XMM02= XMM03=
XMM04= XMM05=
XMM06=7ffe003000160014 XMM07=
(qemu) x/28x 0xf84e7000
f84e7000: 0x 0x 0x 0x
f84e7010: 0x 0x00209b00 0x 0x00cf9300
f84e7020: 0x 0x00cffb00 0x 0x00cff300
f84e7030: 0x 0x0020fb00 0x 0x
f84e7040: 0x80700068 0x8b4e 0xf800 0x
f84e7050: 0xdfff 0x7e40f3fd 0x 0x
f84e7060: 0x 0x00cf9b00 0x 0x
  


According to the GDT, gs:base will be zero after executing the 'mov gs' 
instruction (but gs:base is not zero prior, as seen in 'info 
registers').  This is the same info I get on Intel, so it seems the GDT 
is maintained correctly.


More questions:
- is the bad 'mov gs' instruction reached on AMD?  or is it avoided 
somehow?  What about bare metal?
- does the attached program fail when compiled and run in cygwin on an 
AMD host?
- does setjmp()/longjmp() come from the Windows run-time library, or 
from cygwin?


I note that ds, es, ss, and gs all contain the same selector, so I don't 
see how longjmp() can expect that gs.base will not be clobbered after 
executing 'mov gs'.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

#include 

int main()
{
unsigned short gs;
unsigned x;

asm ("mov %%gs, %0\n" : "=g"(gs));
asm ("movl %%gs:0x30, %0\n" : "=r"(x));

printf("gs: %x\n", gs);
printf("gs:0x30: %x\n", x);

asm ("mov %0, %%gs\n" : : "g"(gs));

printf("test\n");

asm ("movl %%gs:0x30, %0\n" : "=r"(x));

return 0;
}


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Avi Kivity  writes:

> Can you rerun on AMD, but set add npt=0 as a kvm-amd module parameter?
> This will determine if this is an mmu bug or kvm-intel bug.

AMD works fine with npt=0, both cygwin and my setjmp()/longjmp() test
program.  dmesg confirms that nested paging is disabled.

Regards,
Matteo Frigo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Matteo Frigo
Avi Kivity  writes:

> - run a cygwin program in an infinite loop (while :; do :; done is
> sufficient)
> - 'info registers', look at gdt
> - 'x/28x 0x$GDT'
>
> I'm interested in offset 0x28, but please provide the whole thing for
> sanity checking.

Here it is.  This is with npt=1, but npt=0 shows exactly the
same GDT.

(qemu) info registers
EAX=0022dad8 EBX=0023 ECX=61108b28 EDX=0043fea7
ESI=0014 EDI= EBP=0022c518 ESP=0022c4f4
EIP=610935e2 EFL=0202 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =002b   00c0f300
CS =0023   00c0fb00
SS =002b   00c0f300
DS =002b   00c0f300
FS =0053 7efdd000 0fff 0040f300
GS =002b 7efdb000  00c0f300
LDT=   
TR =0040 f84e8070 0068 8b00
GDT= f84e7000 006f
IDT= f84e7070 0fff
CR0=80050031 CR2=07ff7dd13000 CR3=2d62c000 CR4=06f8
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
FCW=037f FSW= [ST=0] FTW=00 MXCSR=
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00= XMM01=
XMM02= XMM03=
XMM04= XMM05=
XMM06=7ffe003000160014 XMM07=
(qemu) x/28x 0xf84e7000
f84e7000: 0x 0x 0x 0x
f84e7010: 0x 0x00209b00 0x 0x00cf9300
f84e7020: 0x 0x00cffb00 0x 0x00cff300
f84e7030: 0x 0x0020fb00 0x 0x
f84e7040: 0x80700068 0x8b4e 0xf800 0x
f84e7050: 0xdfff 0x7e40f3fd 0x 0x
f84e7060: 0x 0x00cf9b00 0x 0x
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested kvm -> vmware esx

2009-02-13 Thread Alexander Graf

Does the first or second level guest panic?

Alex

On 13.02.2009, at 18:38, Jeffry Molanus   
wrote:



I have applied the patches you suggested and the system boots. The
system boots *very* slow and panics once it boots. I do not have the
dump with me right now but its something in sched.c

Jeffry

On Thu, Feb 12, 2009 at 6:25 PM, Alexander Graf  wrote:

Jeffry Molanus wrote:

Ps. I did try the -cpu switches ofcourse

Jeffry



Please try to use the -cpu phenom CPU and Nested Paging on an AMD  
CPU.
That's the only configuration I got things working at least a bit  
with.

If that work so far, apply the VMware backdoor patch I posted on the
qemu list, that exposes the TSC speed to ESX.

Using that configuration I was able to run ReactOS in the nested  
guest
pretty well. Linux got stuck somewhere in udev though. Don't try 64- 
bit

Linux yet.

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Jamie Lokier
Paul Brook wrote:
> > > A simple "Something changed, please try your filter again" callback
> > > automatically covers all these cases.
> >
> > It doesn't, if tap has no memory of how many clients require a filter.
> 
> I'm talking about a callback for devices requesting an inbound
> filter. Devices implementing outgoing (device->vlan) filters just do
> as they're told.

So am I.

> > If tap just answers "YES and installs the kernel filter" or "NO and
> > doesn't install the kernel filter" and doesn't remember how many
> > clients need a filter, then:
> >...
> > In other words, tap needs to distinguish three states:
> >
> >      "1 filter requested and installed in the kernel"
> >      ">1 filter requested, none installed in the kernel"
> >      "0 filters requested, none installed in the kernel"
> 
> Absolutely not. This is the reason we have separate the "request an incoming 
> filter" API from the "provide an outgoing filter" callback. It allows the 
> vlan code to arbitrate in the middle.

I'm guessing that "vlan code to arbitrate in the middle" is exactly
what I've suggested.

Maybe the description was messed up.  Substitute something else for
"tap" in the above: "The _vlan arbitration code which talks to the tap
device_ needs to distinguish three states...".

> A vlan is a bus network, not a set of point-point connections.

Yes, this is what I've assumed.

The callback you suggest for devices requesting an inbound filter will
infinite-loop when there's two such devices on the same vlan bus,
because each time the callback is called, that device will re-issue
its filter request which triggers the callback on the other similar
device.  Back and forth.

To avoid the infinite loop, the vlan code in the middle (if that's
where you want it, and I agree) has to distinguish between no inbound
filters requested by attached devices, and multiple incompatible
inbound filters requested by attached devices.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

Can you rerun on AMD, but set add npt=0 as a kvm-amd module parameter?
This will determine if this is an mmu bug or kvm-intel bug.



AMD works fine with npt=0, both cygwin and my setjmp()/longjmp() test
program.  dmesg confirms that nested paging is disabled.
  


So it isn't the mmu.

I'm a a little far latency-wise from an AMD host.  Can you run this 
experiment on AMD (-monitor stdio will help):


- run a cygwin program in an infinite loop (while :; do :; done is 
sufficient)

- 'info registers', look at gdt
- 'x/28x 0x$GDT'

I'm interested in offset 0x28, but please provide the whole thing for 
sanity checking.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

Can you rerun on AMD, but set add npt=0 as a kvm-amd module parameter?
This will determine if this is an mmu bug or kvm-intel bug.



AMD works fine with npt=0, both cygwin and my setjmp()/longjmp() test
program.  dmesg confirms that nested paging is disabled.
  


So it isn't the mmu.

I'm a a little far latency-wise from an AMD host.  Can you run this 
experiment on AMD (-monitor stdio will help):


- run a cygwin program in an infinite loop (while :; do :; done is 
sufficient)

- 'info registers', look at gdt
- 'x/28x 0x$GDT'

I'm interested in offset 0x28, but please provide the whole thing for 
sanity checking.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-userspace compile problem

2009-02-13 Thread Hans de Bruin

Glauber Costa wrote:

e$ ./configure --prefix=/usr  --with-patched-kernel
Install prefix/usr
BIOS directory/usr/share/qemu
binary directory  /usr/bin
Manual directory  /usr/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path   /home/hns/kvm-userspace/qemu
C compilergcc
Host C compiler   gcc
ARCH_CFLAGS   -m64
make  make
install   install
host CPU  x86_64
host big endian   no
target list   x86_64-softmmu
gprof enabled no
sparse enabledno
profiler  no
static build  no
-Werror enabled   no
SDL support   yes
SDL static link   no
curses supportyes
mingw32 support   no
Audio drivers oss
Extra audio cards ac97 es1370 sb16
Mixer emulation   no
VNC TLS support   yes
   TLS CFLAGS
   TLS LIBS  -lgnutls
kqemu support no
kvm support   no - (#error Missing KVM capability
KVM_CAP_DESTROY_MEMORY_REGION_WORKS)


try passing --kerneldir=


That did not help, but after I changed this:

~/kvm-userspace$ diff configure.old configure
95,96c95,96
< #set kenel directory
< libkvm_kerneldir=$(readlink -f kernel)
---
> #set kernel directory
> libkvm_kerneldir=$(readlink -f $kerneldir)

the error was gone. configure used kvm.h in /usr/include/linux/ (distro 
suplied 2.6.27 headers) instead of the git version.


My troubles aren't totaly gone however:

make -C libkvm
make[1]: Entering directory `/home/hns/kvm-userspace/libkvm'
gcc -m64 -D__x86_64__ -MMD -MF ./.libkvm.d -g -fomit-frame-pointer -Wall 
 -fno-stack-protector   -

I /usr/src/linux-2.6/include   -c -o libkvm.o libkvm.c
In file included from /usr/include/bits/fcntl.h:24,
 from /usr/include/fcntl.h:34,
 from libkvm.c:30:
/usr/include/sys/types.h:46: error: conflicting types for 'loff_t'
/usr/src/linux-2.6/include/linux/types.h:57: error: previous declaration 
of 'loff_t' was here

/usr/include/sys/types.h:62: error: conflicting types for 'dev_t'
/usr/src/linux-2.6/include/linux/types.h:19: error: previous declaration 
of 'dev_t' was here

In file included from /usr/include/sys/types.h:133,
 from /usr/include/bits/fcntl.h:24,
 from /usr/include/fcntl.h:34,
 from libkvm.c:30:
/usr/include/time.h:105: error: conflicting types for 'timer_t'
/usr/src/linux-2.6/include/linux/types.h:28: error: previous declaration 
of 'timer_t' was here

In file included from /usr/include/bits/fcntl.h:24,
 from /usr/include/fcntl.h:34,
 from libkvm.c:30:
/usr/include/sys/types.h:198: error: conflicting types for 'int64_t'
/usr/src/linux-2.6/include/linux/types.h:125: error: previous 
declaration of 'int64_t' was here

/usr/include/sys/types.h:204: error: conflicting types for 'u_int64_t'
/usr/src/linux-2.6/include/linux/types.h:124: error: previous 
declaration of 'u_int64_t' was here

In file included from /usr/include/sys/types.h:220,
 from /usr/include/bits/fcntl.h:24,
 from /usr/include/fcntl.h:34,
 from libkvm.c:30:
/usr/include/sys/select.h:78: error: conflicting types for 'fd_set'
/usr/src/linux-2.6/include/linux/types.h:18: error: previous declaration 
of 'fd_set' was here

In file included from /usr/include/bits/fcntl.h:24,
 from /usr/include/fcntl.h:34,
 from libkvm.c:30:
/usr/include/sys/types.h:235: error: conflicting types for 'blkcnt_t'
/usr/src/linux-2.6/include/linux/types.h:146: error: previous 
declaration of 'blkcnt_t' was here

In file included from /usr/include/inttypes.h:28,
 from libkvm.c:37:
/usr/include/stdint.h:56: error: conflicting types for 'uint64_t'
/usr/src/linux-2.6/include/linux/types.h:123: error: previous 
declaration of 'uint64_t' was here

make[1]: *** [libkvm.o] Error 1
make[1]: Leaving directory `/home/hns/kvm-userspace/libkvm'
make: *** [libkvm] Error 2




--
Hans




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cygwin bash's built-in test command crashes on Windows 2008 Server 64bit under KVM

2009-02-13 Thread Avi Kivity

Matteo Frigo wrote:

Avi Kivity  writes:

  

- add a watchpoint to break when the value of gs:[0x30] changes



It seems that the problem can be reproduced by compiling the following
simple program using cygwin's gcc.  The program crashes on w2k3-amd64
on kvm-83 on core2-duo, and it does not crash on the same w2k3-amd64
installation on kvm-83 on AMD Phenom.

  #include 

  jmp_buf env;
  main()
  {
   if(setjmp(env)) return;
   longjmp(env, 1);
  }

The problem seems to be in the instruction ``mov gs,ax'' (Intel
syntax) in the longjmp() code.  If I let the virtual machine execute
the instruction, the program crashes.  However, if I step over the
instruction using the vs2008 debugger, the program completes without
crashing.  Thus, I think that this is the instruction that Avi is
looking for, but I don't know how to proceed from here.
  


I've decoded the global descriptor table for this, and I get:

(qemu) xp/14x 0x266b000
0266b000: 0x 0x 0x 0x
0266b010: 0x 0x00209b00 0x 0x00cf9300
0266b020: 0x 0x00cffb00 0x 0x00cff300
0266b030: 0x 0x0020fb00

The entry for gs is at offset 0x28, and has base set to 0, which is 
consistent with the crash.


Can you rerun on AMD, but set add npt=0 as a kvm-amd module parameter?  
This will determine if this is an mmu bug or kvm-intel bug.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Chris Wright
* Chris Wright (chr...@redhat.com) wrote:
> * Matthew Wilcox (matt...@wil.cx) wrote:
> > I might suggest a second approach which would be to have an explicit
> > echo to the bind file ignore the list of ids.  Then you wouldn't need to
> > 'echo -n "8086 10de"' to begin with.
> 
> I tried that first, and it dips into the driver logic, where it wants
> to filter via ->match.  Untested patch below _should_ be enough to avoid
> adding the id to begin with.

OK, after making it actually compile.  Still gets trapped into generic
logic, this time in pci core.  I'm starting to remember why dynid looked
like the better option.

pci_device_probe
  __pci_device_probe
pci_match_device()  <-- fails
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use KVM_UPSTREAM for ppc.

2009-02-13 Thread Hollis Blanchard
On Fri, 2009-02-13 at 13:07 -0500, Glauber Costa wrote:
> ppc should compile with upstream qemu code, so, put these
> defines in ppc specific code that references kvm functions.
> 
> We don put them in config.h in this case, since there are
> files (like vl.c) that includes both kvm.h and qemu-kvm.h,
> and would break compilation.
> 
> Signed-off-by: Glauber Costa 

I actually haven't tried to build kvm-userspace for PPC since we got
upstream qemu working...

I'm a little confused though, because it doesn't look like even x86 is
using this?

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change

2009-02-13 Thread Chris Wright
* Jamie Lokier (ja...@shareable.org) wrote:
> no reason to believe kvm-83 is "stable", but there's no reason to
> believe any other version of KVM is especially stable either - there's
> no stabilising bug fix only branch that I'm aware of.

There's ad-hoc one w/out formal releases.  But...never been closer ;-)

http://thread.gmane.org/gmane.comp.emulators.kvm.devel/28179

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Chris Wright
* Matthew Wilcox (matt...@wil.cx) wrote:
> On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> >   - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
> > bridge must be assigned to the same VT-d domain - i.e given device 
> > A (:0f:1.0) and device B (and :0f:2.0), if you assign 
> > device A to guest, you cannot then use device B in the host or 
> > another guest.
> 
> Is that a limitation of the VT-d / IOMMU setup?

Yes.  The source id will essentially show up as the bridge.

> >   - Some newer PCIe devices (and newer conventional PCI devices too via 
> > PCI Advanced Features) support Function Level Reset (FLR). This 
> > allows a PCI function to be reset without affecting any other 
> > functions on that device, or any other devices. This feature is not 
> > widespread yet AFAIK - e.g. I've seen it on an audio controller, 
> > and it must also be supported by SR-IOV devices.
> 
> Yes, that's definitely not very widespread yet.  OTOH, we don't need to
> worry about disturbing other functions if all devices behind the same
> bridge have to be mapped to the same guest.

FLR (when it exists) would work fine for devices not behind a conventional
pci bridge.

> > Driver Unbinding
> > 
> > 
> > Before a device is assigned to a guest, we should make sure that no host
> > device driver is currently bound to the device.
> > 
> > We can do that with e.g.
> > 
> >  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> > 
> > One minor problem with this scheme is that at this point you can't
> > unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> > In order to support that, we need a "remove_id" interface to remove the
> > dynamic ID.
> 
> It sounds like you'd be OK with a 'remove_id' interface that only
> removes subsequently-added interfaces.
> 
> I might suggest a second approach which would be to have an explicit
> echo to the bind file ignore the list of ids.  Then you wouldn't need to
> 'echo -n "8086 10de"' to begin with.

I tried that first, and it dips into the driver logic, where it wants
to filter via ->match.  Untested patch below _should_ be enough to avoid
adding the id to begin with.

> > Furthermore, it should be possible to do this without actually affecting
> > any of the devices - i.e. a "try to unbind and see if we oops" approach
> > clearly isn't great.
> 
> Well, yes.  I'd even be upset if my network or storage flickered away
> briefly while another using was starting to run KVM.
> 
> > This last constraint is the most difficult and points to the logic
> > needing to be in userland management libraries. Possibly the only sane
> > kernel space support would be "try to unbind and reset; if it works then
> > the device is assignable".
> 
> If we expose a 'reset' file in the /sys/bus/pci/devices/*/ directories
> for devices that are resettable, that should be enough, I would think.

Yes, currently it's only internal and it's not robust given the reset
constraints.

thanks,
-chris
--

 drivers/base/base.h |1 +
 drivers/base/bus.c  |2 +-
 drivers/base/dd.c   |   15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 0a5f055..60dc346 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -86,6 +86,7 @@ extern void bus_remove_driver(struct device_driver *drv);
 
 extern void driver_detach(struct device_driver *drv);
 extern int driver_probe_device(struct device_driver *drv, struct device *dev);
+extern int driver_bind_probe_device(struct device_driver *drv, struct device 
*dev);
 
 extern void sysdev_shutdown(void);
 extern int sysdev_suspend(pm_message_t state);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 83f32b8..ad28338 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -202,7 +202,7 @@ static ssize_t driver_bind(struct device_driver *drv,
if (dev->parent)/* Needed for USB */
down(&dev->parent->sem);
down(&dev->sem);
-   err = driver_probe_device(drv, dev);
+   err = driver_bind_probe_device(drv, dev);
up(&dev->sem);
if (dev->parent)
up(&dev->parent->sem);
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 315bed8..fba6463 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -184,13 +184,14 @@ int driver_probe_done(void)
  * This function must be called with @dev->sem held.  When called for a
  * USB interface, @dev->parent->sem must be held as well.
  */
-int driver_probe_device(struct device_driver *drv, struct device *dev)
+static int __driver_probe_device(struct device_driver *drv, struct device *dev,
+bool force)
 {
int ret = 0;
 
  

[PATCH] use KVM_UPSTREAM for ppc.

2009-02-13 Thread Glauber Costa
ppc should compile with upstream qemu code, so, put these
defines in ppc specific code that references kvm functions.

We don put them in config.h in this case, since there are
files (like vl.c) that includes both kvm.h and qemu-kvm.h,
and would break compilation.

Signed-off-by: Glauber Costa 
---
 qemu/hw/ppc440.c |1 +
 qemu/hw/ppc440_bamboo.c  |1 +
 qemu/target-ppc/helper.c |1 +
 3 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/ppc440.c b/qemu/hw/ppc440.c
index 00d82e4..164c326 100644
--- a/qemu/hw/ppc440.c
+++ b/qemu/hw/ppc440.c
@@ -18,6 +18,7 @@
 #include "ppc440.h"
 #include "ppc405.h"
 #include "sysemu.h"
+#define KVM_UPSTREAM
 #include "kvm.h"
 
 #define PPC440EP_PCI_CONFIG 0xeec0
diff --git a/qemu/hw/ppc440_bamboo.c b/qemu/hw/ppc440_bamboo.c
index fbd447c..60ddaf4 100644
--- a/qemu/hw/ppc440_bamboo.c
+++ b/qemu/hw/ppc440_bamboo.c
@@ -21,6 +21,7 @@
 #include "boards.h"
 #include "sysemu.h"
 #include "ppc440.h"
+#define KVM_UPSTREAM
 #include "kvm.h"
 #include "kvm_ppc.h"
 #include "device_tree.h"
diff --git a/qemu/target-ppc/helper.c b/qemu/target-ppc/helper.c
index fc62a63..d49acaf 100644
--- a/qemu/target-ppc/helper.c
+++ b/qemu/target-ppc/helper.c
@@ -29,6 +29,7 @@
 #include "exec-all.h"
 #include "helper_regs.h"
 #include "qemu-common.h"
+#define KVM_UPSTREAM
 #include "kvm.h"
 
 //#define DEBUG_MMU
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Matthew Wilcox
On Fri, Feb 13, 2009 at 05:56:44PM +0100, Andi Kleen wrote:
> > +   pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
> > +   if (ctrl & PCI_SRIOV_CTRL_VFE) {
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
> > +   msleep(100);
> 
> That's really long. Hopefully that's really needed.

Yes and no.  The spec says:

  To allow components to perform internal initialization, system software
  must wait for at least 100 ms after changing the VF Enable bit from
  a 0 to a 1, before it is permitted to issue Configuration Requests to
  the VFs which are enabled by that VF Enable bit.

So we don't have to wait here, but we do have to wait before exposing
all these virtual functions to the rest of the system.  Should we add
more complexity, perhaps spawn a thread to do it asynchronously, or add
0.1 seconds to device initialisation?  A question without an easy
answer, iMO.

> > +
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> > +   pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
> > +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
> > +   pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
> > +   if (!offset || (total > 1 && !stride))
> > +   return -EIO;
> > +
> > +   pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
> > +   i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
> > +   pgsz &= ~((1 << i) - 1);
> > +   if (!pgsz)
> > +   return -EIO;
> 
> All the error paths don't seem to undo the config space writes.
> How will the devices behave with half initialized context?

I think we should clear the VF_ENABLE bit.  That action is also fraught
with danger:

  If software Clears VF Enable, software must allow 1 second after VF
  Enable is Cleared before reading any field in the SR-IOV Extended
  Capability or the VF Migration State Array (see Section 3.3.15.1).

Another msleep(1000) here?  Not pretty, but what else can we do?

Not to mention the danger of something else innocently using lspci -
to read a field in the extended capability -- I suspect we also need to
block user config accesses before clearing this bit.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested kvm -> vmware esx

2009-02-13 Thread Jeffry Molanus
I have applied the patches you suggested and the system boots. The
system boots *very* slow and panics once it boots. I do not have the
dump with me right now but its something in sched.c

Jeffry

On Thu, Feb 12, 2009 at 6:25 PM, Alexander Graf  wrote:
> Jeffry Molanus wrote:
>> Ps. I did try the -cpu switches ofcourse
>>
>> Jeffry
>
>
> Please try to use the -cpu phenom CPU and Nested Paging on an AMD CPU.
> That's the only configuration I got things working at least a bit with.
> If that work so far, apply the VMware backdoor patch I posted on the
> qemu list, that exposes the TSC speed to ESX.
>
> Using that configuration I was able to run ReactOS in the nested guest
> pretty well. Linux got stuck somewhere in udev though. Don't try 64-bit
> Linux yet.
>
> Alex
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Matthew Wilcox
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> Hi,

You raise some interesting points.  Thanks for doing that rather than
going off and creating a big pile of patches and demanding they be
applied ;-)

> This gets confusing, so some background constraints first:
> 
>   - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
> bridge must be assigned to the same VT-d domain - i.e given device 
> A (:0f:1.0) and device B (and :0f:2.0), if you assign 
> device A to guest, you cannot then use device B in the host or 
> another guest.

Is that a limitation of the VT-d / IOMMU setup?

>   - Some newer PCIe devices (and newer conventional PCI devices too via 
> PCI Advanced Features) support Function Level Reset (FLR). This 
> allows a PCI function to be reset without affecting any other 
> functions on that device, or any other devices. This feature is not 
> widespread yet AFAIK - e.g. I've seen it on an audio controller, 
> and it must also be supported by SR-IOV devices.

Yes, that's definitely not very widespread yet.  OTOH, we don't need to
worry about disturbing other functions if all devices behind the same
bridge have to be mapped to the same guest.

>   - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
> devices (and functions) behind a PCI bridge.
> 
>   - A PCI Power Management D-state transition (D3hot to D0) can be used 
> to reset a device (all functions).

That's not guaranteed according to PCI PM 1.2:

5.4.1. Software Accessible D3 (D3hot)

  When programmed to D0, the function may return to the D0 Initialized
  or D0 Uninitialized state without PCI RST# being asserted. This option
  is determined at design time and allows designs the option of either
  performing an internal reset or not performing an internal reset.

-

There's also the option that devices in a hotplug PCI slot can have
their power cycled, forcing them into D3cold and then transitioning into
D0 Uninitialised.

>   - Some PCI devices don't have page aligned MMIO BARs. These devices 
> (all functions) cannot be safely assigned to guests.

We've seen patches to force page alignment on this list ... they haven't
been sufficiently beautiful to be applied yet.

> Driver Unbinding
> 
> 
> Before a device is assigned to a guest, we should make sure that no host
> device driver is currently bound to the device.
> 
> We can do that with e.g.
> 
>  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> 
> One minor problem with this scheme is that at this point you can't
> unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> In order to support that, we need a "remove_id" interface to remove the
> dynamic ID.

It sounds like you'd be OK with a 'remove_id' interface that only
removes subsequently-added interfaces.

I might suggest a second approach which would be to have an explicit
echo to the bind file ignore the list of ids.  Then you wouldn't need to
'echo -n "8086 10de"' to begin with.

> Device Reset
> 
> 
> Before assigning a device to a guest, it should be reset. The host or a
> previous guest may have left the device in an unknown state. Not
> resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
> with e1000e devices.

Really, this is the same problem that kexec has.  Either the driver is
doing insufficient initialisation, or it's not doing tis shutdown
properly.  The former is definitely better than the latter as kexec may
be used from a position of having the driver locked solid and unable to
reset the device.

> If we're assigning devices from behind a PCI/PCI-x bridge (remember all
> devices must be assigned together), then we can use SBR to reset them
> all together. Clearly, though, one should make sure that all devices
> behind that bridge are not in use before doing the reset. We could
> implement this with a "reset" sysfs interface for pci-stub - it would
> only reset a device using SBR if all devices behind that bridge were
> bound to pci-stub.

I don't think this should be part of pci-stub, but rather part of the
PCI core.  I can imagine other uses for being able to reset all devices
behind a bridge that don't involve anything to do with v12n.  So I'd
like to see a /sys/class/pci_bus/*/reset (where * would not include root
busses).

> Where a conventional PCI device is on the root bus, or where a PCIe
> device is on the root bus or another bus with multiple devices, we could
> use the D-state transition reset. Since this resets all functions on a
> device, we would need a similar approach where all functions must be
> bound to pci-stub before being reset.

Even with the caveat above about D0 -> D3hot -> D0 doesn't necessarily
do a full reset, it does seem to be per-function.  For example, this
passage fro

Re: [PATCH 3/3] KVM: Enable MSI-X for KVM assigned device

2009-02-13 Thread Marcelo Tosatti
On Fri, Feb 13, 2009 at 01:29:42PM +0800, Sheng Yang wrote:
> The allocation only happen once, at the second time it would report error in 
> current code. But allocate/deallocate is also acceptable for future.

Oops, it OK. 

> > - interrupt context can read the table while kvm_vm_ioctl_set_msix_entry
> >   is modifying it. So you either need to forbid more than one
> >   kvm_vm_ioctl_set_msix_entry call in the lifetime of a guest (which
> >   you can later allow when you support MSI table change), or handle
> >   accesses from multiple contexes now. It seems forbidding is enough for
> >   the moment from what you said.
> 
> Yeah.
> 
> But for the modifying the MSI-X table, the most critical problem is, current 
> Linux didn't support it IIRC. So I have to disable MSI-X then enable it again 
> with new table, and it would result in lost interrupt.
> 
> So seems the most reasonable method is to modify pci_enable_msix() and 
> related 
> function's action to support this...

Alright, then just reply "[PATCH 1/3] KVM: Ioctls for init MSI-X entry"
with the minor comments reworked so Avi can apply.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: Add gsi_msg_pending_bitmap for MSI-X

2009-02-13 Thread Marcelo Tosatti
On Fri, Feb 13, 2009 at 11:37:45AM +0800, Sheng Yang wrote:
> > > +#define KVM_ASSIGNED_DEV_MSIX((1 << 2) | (1 << 10))
> >
> > Can you explain the usage of the two bits?
> 
> Um... Just to keep consistent with formers(one for guest and one for host), 
> at 
> cost of one bit.

OK

> > Can drop the printk's (also from find_gsi_from_host_irq).
> 
> Not that confident. In fact, I often triggered this during debug... IIRC, 
> userspace program shouldn't trigger this if kernel space works well. Maybe it 
> can be changed to WARN_ON() or BUG_ON() later.

OK

> > Check the return value?
> 
> Yeah...
> >
> > > + enable_irq(irq);
> >
> > Do you guarantee that particular irq you're enable_irq'ing is not bogus?
> > Its has been passed from userspace after all.
> 
> It isn't passed from userspace. This one is filled by pci_enable_msix(), 
> which 
> should be OK. 

Alright.

> > So you chose GSI == 0 as invalid because of x86 assumptions? Or is there
> > any other reason?
> 
> Yeah, it based on x86 and IA64 IRQ 0 can't be used by MSI-X. And only x86 
> support MSI-X now(and IA64 would follow later).

OK.

> >
> > IRQ sharing in the host side is not supported correct?
> 
> Um? Yeah... And seems we won't support it forever...

OK.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Mark McLoughlin
On Fri, 2009-02-13 at 08:56 -0800, Greg KH wrote:
> On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> > Driver Unbinding
> > 
> > 
> > Before a device is assigned to a guest, we should make sure that no host
> > device driver is currently bound to the device.
> > 
> > We can do that with e.g.
> > 
> >  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> > 
> > One minor problem with this scheme is that at this point you can't
> > unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> 
> Are you sure?  It should work if you manually tell the e1000e driver to
> bind to it, after unbinding it from the pci-stub driver.

Yes, that works - I meant using /sys/bus/pci/drivers_probe. The problem
is that it would suck for management tools to have to remember which
device driver it was originally bound to.

> > In order to support that, we need a "remove_id" interface to remove the
> > dynamic ID.
> 
> Why?

Before assignment:

 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/remove_id

After assignment:

 $> echo -n :00:19.0 > /sys/bus/pci/drivers_probe

> > What we don't support is a way to unbind permanently. Xen has a
> > pciback.hide module param which tries to achieve this, but you end up
> > with the inevitable issues around making sure pciback is loaded before
> > the device driver etc.
> 
> What do you mean, unbind "permanently"?  For every reboot?  Or just
> within the same boot time?

Across reboots, yeah.

Cheers,
Mark.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Paul Brook
> > A simple "Something changed, please try your filter again" callback
> > automatically covers all these cases.
>
> It doesn't, if tap has no memory of how many clients require a filter.

I'm talking about a callback for devices requesting an inbound filter. Devices 
implementing outgoing (device->vlan) filters just do as they're told.

> If tap just answers "YES and installs the kernel filter" or "NO and
> doesn't install the kernel filter" and doesn't remember how many
> clients need a filter, then:
>...
> In other words, tap needs to distinguish three states:
>
>      "1 filter requested and installed in the kernel"
>      ">1 filter requested, none installed in the kernel"
>      "0 filters requested, none installed in the kernel"

Absolutely not. This is the reason we have separate the "request an incoming 
filter" API from the "provide an outgoing filter" callback. It allows the 
vlan code to arbitrate in the middle. A vlan is a bus network, not a set of 
point-point connections. I haven't checked whether the proposed patch gets 
this right. I suspect it probably doesn't.

This is why the initial patch that had clients talking to each other directly 
was completely wrong.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Greg KH
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> Driver Unbinding
> 
> 
> Before a device is assigned to a guest, we should make sure that no host
> device driver is currently bound to the device.
> 
> We can do that with e.g.
> 
>  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> 
> One minor problem with this scheme is that at this point you can't
> unbind from pci-stub and trigger a re-probe and have e1000e bind to it.

Are you sure?  It should work if you manually tell the e1000e driver to
bind to it, after unbinding it from the pci-stub driver.

> In order to support that, we need a "remove_id" interface to remove the
> dynamic ID.

Why?

> What we don't support is a way to unbind permanently. Xen has a
> pciback.hide module param which tries to achieve this, but you end up
> with the inevitable issues around making sure pciback is loaded before
> the device driver etc.

What do you mean, unbind "permanently"?  For every reboot?  Or just
within the same boot time?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/7] PCI: initialize and release SR-IOV capability

2009-02-13 Thread Andi Kleen
Yu Zhao  writes:
> +
> +
> +static int sriov_init(struct pci_dev *dev, int pos)
> +{
> + int i;
> + int rc;
> + int nres;
> + u32 pgsz;
> + u16 ctrl, total, offset, stride;
> + struct pci_sriov *iov;
> + struct resource *res;
> + struct pci_dev *pdev;
> +
> + if (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
> + dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)
> + return -ENODEV;
> +

It would be a good idea to put a might_sleep() here just in 
case the msleep happens below and drivers call it incorrectly.

> + pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
> + if (ctrl & PCI_SRIOV_CTRL_VFE) {
> + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
> + msleep(100);


That's really long. Hopefully that's really needed.

> +
> + pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> + pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
> + pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
> + pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
> + if (!offset || (total > 1 && !stride))
> + return -EIO;
> +
> + pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
> + i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
> + pgsz &= ~((1 << i) - 1);
> + if (!pgsz)
> + return -EIO;

All the error paths don't seem to undo the config space writes.
How will the devices behave with half initialized context?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Jamie Lokier
Paul Brook wrote:
> > Well, you do need some way to notify a client that the filter which
> > used to work has been removed because it's no longer available, for
> > example when migrating between host kernels.
> 
> This is actually more evidence that the "add" and "remove" callbacks are 
> bogus.

I agree, "add" and "remove" are misnamed among other bogosities.

> My point is that we're allowing the filter to fail for arbitrary
> reasons. Once you assume this, trying to be clever is just going to
> catch you out when we encounter a case you didn't think of.

Yes.

> A simple "Something changed, please try your filter again" callback 
> automatically covers all these cases.

It doesn't, if tap has no memory of how many clients require a filter.

If tap just answers "YES and installs the kernel filter" or "NO and
doesn't install the kernel filter" and doesn't remember how many
clients need a filter, then:

1. Client A requests filter A
 -> tap says YES, installs kernel filter.
2. Client B requests filter B
 -> tap can't do both, so deinstalls kernel filter.
 -> tap says NO to client B.
 -> tap notifies client A "there's been a change".
3. Client A requests filter A
 -> tap doesn't remember that client B wants a filter, so
 -> tap says YES.
 -> tap notifies client B "there's been a change".
4. Goto 2
 -> endless loop.

In other words, tap needs to distinguish three states:

 "1 filter requested and installed in the kernel"
 ">1 filter requested, none installed in the kernel"
 "0 filters requested, none installed in the kernel"

The interface to clients needs to keep that state updated in tap
somehow.  Just requesting a filter and reverting to software filtering
when the request failed doesn't do that.

Note this has nothing to do with the software filtering fallback.
Even if qemu internal filtering is always done, you still need the
above to keep the kernel filter correct.

Imho, since there needs to be software filter code for fallback
_anyway_, a better place to put that fallback is in the
client-independent plumbing which relays packets to each client.  And
if it's there, you can have a different software fallback filter for
each client anyway, so clients can assume filters are always installed
and working perfectly.

The interface between the plumbing and the tap driver still needs to
handle the issues described above.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM PCI device assignment issues

2009-02-13 Thread Mark McLoughlin
Hi,

KVM has support for PCI device assignment using VT-d and AMD IOMMU, but
there are a number of inter-related issues that need some further
discussion:

  - Unbinding devices from any existing device driver before assignment

  - Resetting devices before and after assignment

  - Helping users figure out which devices can actually be assigned

This gets confusing, so some background constraints first:

  - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
bridge must be assigned to the same VT-d domain - i.e given device 
A (:0f:1.0) and device B (and :0f:2.0), if you assign 
device A to guest, you cannot then use device B in the host or 
another guest.

  - Some newer PCIe devices (and newer conventional PCI devices too via 
PCI Advanced Features) support Function Level Reset (FLR). This 
allows a PCI function to be reset without affecting any other 
functions on that device, or any other devices. This feature is not 
widespread yet AFAIK - e.g. I've seen it on an audio controller, 
and it must also be supported by SR-IOV devices.

  - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
devices (and functions) behind a PCI bridge.

  - A PCI Power Management D-state transition (D3hot to D0) can be used 
to reset a device (all functions).

  - Some PCI devices don't have page aligned MMIO BARs. These devices 
(all functions) cannot be safely assigned to guests.

Driver Unbinding


Before a device is assigned to a guest, we should make sure that no host
device driver is currently bound to the device.

We can do that with e.g.

 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind

One minor problem with this scheme is that at this point you can't
unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
In order to support that, we need a "remove_id" interface to remove the
dynamic ID.

What we don't support is a way to unbind permanently. Xen has a
pciback.hide module param which tries to achieve this, but you end up
with the inevitable issues around making sure pciback is loaded before
the device driver etc.

Permanent unbinding isn't necessarily needed, but it might help provide
a solution to some of the nastier issues below.

Device Reset


Before assigning a device to a guest, it should be reset. The host or a
previous guest may have left the device in an unknown state. Not
resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
with e1000e devices.

FLR is without doubt the preferable solution here. KVM already
implements this. However, the range of devices which support FLR is
currently quite limited.

If we're assigning devices from behind a PCI/PCI-x bridge (remember all
devices must be assigned together), then we can use SBR to reset them
all together. Clearly, though, one should make sure that all devices
behind that bridge are not in use before doing the reset. We could
implement this with a "reset" sysfs interface for pci-stub - it would
only reset a device using SBR if all devices behind that bridge were
bound to pci-stub.

Where a conventional PCI device is on the root bus, or where a PCIe
device is on the root bus or another bus with multiple devices, we could
use the D-state transition reset. Since this resets all functions on a
device, we would need a similar approach where all functions must be
bound to pci-stub before being reset.

Furthermore, we would need to prevent pci-stub from resetting a device
it is bound to where the device is already assigned to a guest. To
achieve this, we would want KVM to explicitly call in to pci-stub to
mark a device as in use.

The alternatives to such an approach are:

  a) Only support FLR capable devices

  b) Cross our fingers and hope that work without a device reset

  c) Allow a driver to be permanently unbound from a device and require 
 the user to reboot after unbinding before assigning

Filtering
=

In order to support a sane user interface in management tools, it should
be possible to list all PCI devices on available on a host and filter
out those which cannot be assigned to a guest.

Furthermore, it should be possible to do this without actually affecting
any of the devices - i.e. a "try to unbind and see if we oops" approach
clearly isn't great.

Finally, some management tools would like to be able to do this
filtering given the constraint of a device being reserved for a
currently inactive guest.

This last constraint is the most difficult and points to the logic
needing to be in userland management libraries. Possibly the only sane
kernel space support would be "try to unbind and reset; if it works then
the device is assignable".

Conclusions
===

Only supporting devices with FLR restricts our user pool far too
severely.

Permanent unbind

Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change

2009-02-13 Thread Jamie Lokier
Marc Bevand schrieb:
> I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older
> because of the qcow2 performance regression caused by the default
> writethrough caching policy) but it randomly triggers an even worse
> bug: the moment I shut down a guest by typing "quit" in the monitor,
> it sometimes overwrite the first 4kB of the disk image with mostly
> NUL bytes (!) which completely destroys it. I am familiar with the
> qcow2 format and apparently this 4kB block seems to be an L2 table
> with most entries set to zero. I have had to restore at least 6 or 7
> disk images from backup after occurences of that bug.

Ow!  That's a really serious bug.  How many of us have regular hourly
backups of our disk images?  And how many of us are running databases
or mail servers on our VMs, where even restoring from a recent backup
is a harmful event?

I've not noticed this bug reported by Marc, probably because I nearly
always finish a KVM session by killing it, either because I'm testing
or because KVM locks up occasionally and needs kill -9 :-(

And because I've not used any KVM since kvm-72 in production until
recently, only for testing my personal VMs.

I must say, _thank goodness_ that the bug I reported occurs at boot
time, and caused me to revert the qcow2 code.  I'm now running a
crticial VM on kvm-83 with reverted qcow2.  Sure it's risky as there's
no reason to believe kvm-83 is "stable", but there's no reason to
believe any other version of KVM is especially stable either - there's
no stabilising bug fix only branch that I'm aware of.

If I hadn't had the boot time bug which I reported, I could have
unrecoverable corruption instead from Marc's bug.

For the time being, I'm going to _strongly_ advise my VM using
professional clients to never, *ever* use qcow2 except for snapshot
testing.

Unfortunately the other delta/growable formats seem to be even less
reliable, because they're not used much, so they should be avoided too.

This corruption plus the data integrity/durability issues on host
failure are a big deal.  Even with kvm-72, I'm nervous about qcow2 now.
Just because a bug hasn't caused obvious guest failures, doesn't mean
it's not happening.

Is there a way to restructure the code and/or how it works so it's
more clearly correct?

> My intuition tells me this may be the qcow2 code trying to allocate
> a cluster to write a new L2 table, but not noticing the allocation
> failed (represented by a 0 offset), and writing the L2 table at that
> 0 offset, overwriting the qcow2 header.

My intuition says it's important to identify the cause of this, as it
might not be qcow2 but the AIO code going awry with a random offset
when closing down, e.g. if there's a use-after-free bug.

Marc..  this is quite a serious bug you've reported.  Is there a
reason you didn't report it earlier?

-- Jamie 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Paul Brook
On Friday 13 February 2009, Jamie Lokier wrote:
> Paul Brook wrote:
> > > We could use a changed() function, but it would need to know the
> > > direction of the change, which leads back to the same mechanics.  If
> > > there's a better way, please suggest it.  Thanks,
> >
> > I still don't see why the device needs to know what's changed. The
> > response should always be the same: Request a filter, and see if it
> > works.
>
> Well, you do need some way to notify a client that the filter which
> used to work has been removed because it's no longer available, for
> example when migrating between host kernels.

This is actually more evidence that the "add" and "remove" callbacks are 
bogus.

My point is that we're allowing the filter to fail for arbitrary reasons. Once 
you assume this, trying to be clever is just going to catch you out when we 
encounter a case you didn't think of.

A simple "Something changed, please try your filter again" callback 
automatically covers all these cases.

It may be that in some cases qemu already knows the filter is going to fail. 
However these events are rare (i.e. not performance critical) so it's far 
simpler to just use the same callback and have the device try anyway.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Running KVM on a Laptop

2009-02-13 Thread Jorge Lucángeli Obes
On Thu, Feb 12, 2009 at 11:56 PM, Ross McKay  wrote:
> dnjap wrote:
>
>>I'm looking for a laptop on which I can run KVM.
>>
>>1. Does anyone have a list of AMD-V or VT-x capable laptop CPU's?

http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors

The header before each chart lists which processors have VT-x.
Something similar must be available for the laptop version of Intel's
Nehalem and for AMD.

> Is it not as simple as checking for the svm or vt flags?
>
>egrep '(vt|svm)' /proc/cpuinfo
>
> Take a bootable LiveCD into your local computer shop and check.
>
>>2. It seems Turion 64 X2 and some of Core 2 Duo processors can run KVM.
>>Does anyone know which of these CPU's implement Nested Page Tables or
>>Extended Page Tables? (Xen's compatibility list doesn't provide such
>>details: http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors)

You will probably have to verify that in the manufacturer's web site.
I recall a discussion not long ago regarding the addition of NPT and
EPT to /proc/cpuinfo.

>>3. If you're running KVM on your laptop, could you share the
>>information? Does your laptop's BIOS support AMD-V or VT-x?
>
> FWIW, I have KVM running on my ASUS laptop, with the following CPU:
>
>AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55

I have a Dell XPS 1210 with:

Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz

Regards,
Jorge
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Jamie Lokier
Paul Brook wrote:
> > We could use a changed() function, but it would need to know the
> > direction of the change, which leads back to the same mechanics.  If
> > there's a better way, please suggest it.  Thanks,
> 
> I still don't see why the device needs to know what's changed. The response 
> should always be the same: Request a filter, and see if it works.

Well, you do need some way to notify a client that the filter which
used to work has been removed because it's no longer available, for
example when migrating between host kernels.

Or implement reliable filtering in the infrastructure which relays
packets internally in QEMU, so that each client can request a filter
and it always works, and the tap driver can tell the QEMU
infrastructure when kernel filtering is working and not, but each
client doesn't need to know that.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/4] qemu:virtio-net: Add support for qemu_vlan_rxfilter

2009-02-13 Thread Paul Brook
> - A device requests a filter and is told if the request is successful
>   - On success the device may skip it's own filtering
> - If another vlan client is added, the following _must_ occur:
>   - The "filterer" must clear (remove) the filter
>   - The "filteree" must revert to using their own filtering
> - If a vlan client is removed, the following _may_ occur:
>   - vlan clients are notified that they may retry their filter
>..
> The added()/removed() interface feels like the right solution to this.

I think your analysis is fundamentally flawed. Firstly I'm not sure the above 
holds when going between 1 and 2 clients on a vlan.  Even ignoring that, you 
are making implicit assumptions about when a filter will succeed. If these 
assumptions are broken (which seems likely if we ever implement filtering 
with more than 2 devices on a vlan) they you'll get subtle breakage in every 
single user of the API.

> We could use a changed() function, but it would need to know the
> direction of the change, which leads back to the same mechanics.  If
> there's a better way, please suggest it.  Thanks,

I still don't see why the device needs to know what's changed. The response 
should always be the same: Request a filter, and see if it works.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qcow2 corruption observed, fixed by reverting old change

2009-02-13 Thread Kevin Wolf
Hi Marc,

You should not take qemu-devel out of the CC list. This is where the
bugs need to be fixed, they aren't KVM specific. I'm quoting your
complete mail to forward it to where it belongs.

Marc Bevand schrieb:
> Jamie Lokier  shareable.org> writes:
>> As you see from the subject, I'm getting qcow2 corruption.
>>
>> I have a Windows 2000 guest which boots and runs fine in kvm-72, fails
>> with a blue-screen indicating file corruption errors in kvm-73 through
>> to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with
>> the version from kvm-72.
>>
>> The blue screen appears towards the end of the boot sequence, and
>> shows only briefly before rebooting.  It says:
>>
>> STOP: c218 (Registry File Failure)
>> The registry cannot load the hive (file):
>> \SystemRoot\System32\Config\SOFTWARE
>> or its log or alternate.
>> It is corrupt, absent, or not writable.
>>
>> Beginning dump of physical memory
>> Physical memory dump complete. Contact your system administrator or
>> technical support [...?]
> 
> I have got a massive KVM installation with hundreds of guests runnings dozens 
> of
> different OSes, and have also noticed multiple qcow2 corruption bugs. All my
> guests are using the qcow2 format, and my hosts are running vanilla linux 
> 2.6.28
> x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors).
> 
> My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to 
> run
> kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the
> version from kvm-72 to fix the BSOD.
> 
> kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact
> registry corruption error:
> http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599
> This bug is also fixed by reverting block-qcow2.c to the version from kvm-72.
> 
> I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the
> qcow2 performance regression caused by the default writethrough caching 
> policy)
> but it randomly triggers an even worse bug: the moment I shut down a guest by
> typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk
> image with mostly NUL bytes (!) which completely destroys it. I am familiar 
> with
> the qcow2 format and apparently this 4kB block seems to be an L2 table with 
> most
> entries set to zero. I have had to restore at least 6 or 7 disk images from
> backup after occurences of that bug. My intuition tells me this may be the 
> qcow2
> code trying to allocate a cluster to write a new L2 table, but not noticing 
> the
> allocation failed (represented by a 0 offset), and writing the L2 table at 
> that
> 0 offset, overwriting the qcow2 header.
> 
> Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c 
> reverted
> to its kvm-72 version.
> 
> Basically qcow2 in kvm-73 or newer is completely unreliable.
> 
> -marc

I think the corruption is a completely unrelated bug. I would suspect it
was introduced in one of Gleb's patches in December. Adding him to CC.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] [V3] kvm: qemu: deassign device from guest

2009-02-13 Thread Han, Weidong
free_assigned_device just frees device from qemu, it should also
deassign the device from guest when guest exits or hot remove
a device.

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 qemu/hw/device-assignment.c |   28 ++--
 qemu/hw/device-assignment.h |1 +
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 5341ef2..fc89c6f 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -549,6 +549,28 @@ static int assign_irq(AssignedDevInfo *adev)
 return r;
 }
 
+static void deassign_device(AssignedDevInfo *adev)
+{
+struct kvm_assigned_pci_dev assigned_dev_data;
+AssignedDevice *dev = adev->assigned_dev;
+int r;
+
+memset(&assigned_dev_data, 0, sizeof(assigned_dev_data));
+assigned_dev_data.assigned_dev_id  =
+   calc_assigned_dev_id(dev->h_busnr, dev->h_devfn);
+
+r = kvm_deassign_pci_device(kvm_context, &assigned_dev_data);
+if (r < 0)
+   fprintf(stderr, "Failed to deassign device \"%s\" : %s\n",
+adev->name, strerror(-r));
+}
+
+void remove_assigned_device(AssignedDevInfo *adev)
+{
+deassign_device(adev);
+free_assigned_device(adev);
+}
+
 /* The pci config space got updated. Check if irq numbers have changed
  * for our devices
  */
@@ -563,7 +585,7 @@ void assigned_dev_update_irqs()
 
 r = assign_irq(adev);
 if (r < 0)
-free_assigned_device(adev);
+remove_assigned_device(adev);
 
 adev = next;
 }
@@ -615,10 +637,12 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 /* assign device to guest */
 r = assign_device(adev);
 if (r < 0)
-goto out;
+goto assigned_out;
 
 return &dev->dev;
 
+assigned_out:
+deassign_device(adev);
 out:
 free_assigned_device(adev);
 return NULL;
diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h
index 6a9b9fa..84f3f32 100644
--- a/qemu/hw/device-assignment.h
+++ b/qemu/hw/device-assignment.h
@@ -97,6 +97,7 @@ struct AssignedDevInfo {
 PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus);
 AssignedDevInfo *add_assigned_device(const char *arg);
 void add_assigned_devices(PCIBus *bus, const char **devices, int n_devices);
+void remove_assigned_device(AssignedDevInfo *adev);
 ram_addr_t assigned_dev_load_option_roms(ram_addr_t rom_base_offset);
 void assigned_dev_update_irqs(void);
 
-- 
1.6.0.4


0006-kvm-qemu-deassign-device-from-guest.patch
Description: 0006-kvm-qemu-deassign-device-from-guest.patch


[PATCH 7/7] [V3] kvm: qemu: fix hot remove assigned device with iommu

2009-02-13 Thread Han, Weidong
when hot remove the assigned device with iommu, it should
deassign it from guest and free it from qemu.

assign_dev_update_irqs may not be invoked when hot add a device,
so need to assign irq after device assignment in
init_assigned_device.

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 qemu/hw/device-assignment.c |   20 
 qemu/hw/device-assignment.h |1 +
 qemu/hw/device-hotplug.c|   17 +
 3 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index fc89c6f..d6acc67 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -571,6 +571,21 @@ void remove_assigned_device(AssignedDevInfo *adev)
 free_assigned_device(adev);
 }
 
+AssignedDevInfo *get_assigned_device(int pcibus, int slot)
+{
+AssignedDevice *assigned_dev = NULL;
+AssignedDevInfo *adev = NULL;
+
+LIST_FOREACH(adev, &adev_head, next) {
+assigned_dev = adev->assigned_dev;
+if (pci_bus_num(assigned_dev->dev.bus) == pcibus &&
+PCI_SLOT(assigned_dev->dev.devfn) == slot)
+return adev;
+}
+
+return NULL;
+}
+
 /* The pci config space got updated. Check if irq numbers have changed
  * for our devices
  */
@@ -639,6 +654,11 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 if (r < 0)
 goto assigned_out;
 
+/* assign irq for the device */
+r = assign_irq(adev);
+if (r < 0)
+goto assigned_out;
+ 
 return &dev->dev;
 
 assigned_out:
diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h
index 84f3f32..da775d7 100644
--- a/qemu/hw/device-assignment.h
+++ b/qemu/hw/device-assignment.h
@@ -98,6 +98,7 @@ PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus 
*bus);
 AssignedDevInfo *add_assigned_device(const char *arg);
 void add_assigned_devices(PCIBus *bus, const char **devices, int n_devices);
 void remove_assigned_device(AssignedDevInfo *adev);
+AssignedDevInfo *get_assigned_device(int pcibus, int slot);
 ram_addr_t assigned_dev_load_option_roms(ram_addr_t rom_base_offset);
 void assigned_dev_update_irqs(void);
 
diff --git a/qemu/hw/device-hotplug.c b/qemu/hw/device-hotplug.c
index 671acb2..03987ab 100644
--- a/qemu/hw/device-hotplug.c
+++ b/qemu/hw/device-hotplug.c
@@ -63,6 +63,14 @@ static PCIDevice *qemu_system_hot_assign_device(const char 
*opts, int bus_nr)
 return ret;
 }
 
+static void qemu_system_hot_deassign_device(AssignedDevInfo *adev)
+{
+remove_assigned_device(adev);
+
+term_printf("Unregister host PCI device %02x:%02x.%1x "
+   "(\"%s\") from guest\n", 
+   adev->bus, adev->dev, adev->func, adev->name);
+}
 #endif /* USE_KVM_DEVICE_ASSIGNMENT */
 
 static int add_init_drive(const char *opts)
@@ -240,12 +248,21 @@ void device_hot_remove_success(int pcibus, int slot)
 {
 PCIDevice *d = pci_find_device(pcibus, slot);
 int class_code;
+AssignedDevInfo *adev;
 
 if (!d) {
 term_printf("invalid slot %d\n", slot);
 return;
 }
 
+#ifdef USE_KVM_DEVICE_ASSIGNMENT
+adev = get_assigned_device(pcibus, slot);
+if (adev) {
+qemu_system_hot_deassign_device(adev);
+return;
+}
+#endif /* USE_KVM_DEVICE_ASSIGNMENT */
+
 class_code = d->config_read(d, PCI_CLASS_DEVICE+1, 1);
 
 pci_unregister_device(d);
-- 
1.6.0.4


0007-kvm-qemu-fix-hot-remove-assigned-device-with-iommu.patch
Description: 0007-kvm-qemu-fix-hot-remove-assigned-device-with-iommu.patch


[PATCH 5/7] [V3] kvm: qemu: wrap assign_device and assign_irq

2009-02-13 Thread Han, Weidong
just wrap assign_device and assign_irq, no functional changes

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 qemu/hw/device-assignment.c |  121 +--
 1 files changed, 70 insertions(+), 51 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 6046fdd..5341ef2 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -487,6 +487,68 @@ static uint32_t calc_assigned_dev_id(uint8_t bus, uint8_t 
devfn)
 return (uint32_t)bus << 8 | (uint32_t)devfn;
 }
 
+static int assign_device(AssignedDevInfo *adev)
+{
+struct kvm_assigned_pci_dev assigned_dev_data;
+AssignedDevice *dev = adev->assigned_dev;
+int r;
+
+memset(&assigned_dev_data, 0, sizeof(assigned_dev_data));
+assigned_dev_data.assigned_dev_id  =
+   calc_assigned_dev_id(dev->h_busnr, dev->h_devfn);
+assigned_dev_data.busnr = dev->h_busnr;
+assigned_dev_data.devfn = dev->h_devfn;
+
+#ifdef KVM_CAP_IOMMU
+/* We always enable the IOMMU if present
+ * (or when not disabled on the command line)
+ */
+r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU);
+if (r && !adev->disable_iommu)
+   assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU;
+#endif
+ 
+r = kvm_assign_pci_device(kvm_context, &assigned_dev_data);
+if (r < 0)
+   fprintf(stderr, "Failed to assign device \"%s\" : %s\n",
+adev->name, strerror(-r));
+return r;
+}
+
+static int assign_irq(AssignedDevInfo *adev)
+{
+struct kvm_assigned_irq assigned_irq_data;
+AssignedDevice *dev = adev->assigned_dev;
+int irq, r = 0;
+
+irq = pci_map_irq(&dev->dev, dev->intpin);
+irq = piix_get_irq(irq);
+
+#ifdef TARGET_IA64
+irq = ipf_map_irq(&dev->dev, irq);
+#endif
+
+if (dev->girq == irq)
+return r;
+
+memset(&assigned_irq_data, 0, sizeof(assigned_irq_data));
+assigned_irq_data.assigned_dev_id =
+calc_assigned_dev_id(dev->h_busnr, dev->h_devfn);
+assigned_irq_data.guest_irq = irq;
+assigned_irq_data.host_irq = dev->real_device.irq;
+r = kvm_assign_irq(kvm_context, &assigned_irq_data);
+if (r < 0) {
+fprintf(stderr, "Failed to assign irq for \"%s\": %s\n",
+adev->name, strerror(-r));
+fprintf(stderr, "Perhaps you are assigning a device "
+"that shares an IRQ with another device?\n");
+return r;
+}
+
+dev->girq = irq;
+return r;
+}
+
 /* The pci config space got updated. Check if irq numbers have changed
  * for our devices
  */
@@ -497,37 +559,11 @@ void assigned_dev_update_irqs()
 adev = LIST_FIRST(&adev_head);
 while (adev) {
 AssignedDevInfo *next = LIST_NEXT(adev, next);
-AssignedDevice *assigned_dev = adev->assigned_dev;
-int irq, r;
-
-irq = pci_map_irq(&assigned_dev->dev, assigned_dev->intpin);
-irq = piix_get_irq(irq);
-
-#ifdef TARGET_IA64
-   irq = ipf_map_irq(&assigned_dev->dev, irq);
-#endif
+int r;
 
-if (irq != assigned_dev->girq) {
-struct kvm_assigned_irq assigned_irq_data;
-
-memset(&assigned_irq_data, 0, sizeof(assigned_irq_data));
-assigned_irq_data.assigned_dev_id  =
-calc_assigned_dev_id(assigned_dev->h_busnr,
- (uint8_t) assigned_dev->h_devfn);
-assigned_irq_data.guest_irq = irq;
-assigned_irq_data.host_irq = assigned_dev->real_device.irq;
-r = kvm_assign_irq(kvm_context, &assigned_irq_data);
-if (r < 0) {
-fprintf(stderr, "Failed to assign irq for \"%s\": %s\n",
-adev->name, strerror(-r));
-fprintf(stderr, "Perhaps you are assigning a device "
-"that shares an IRQ with another device?\n");
-free_assigned_device(adev);
-adev = next;
-continue;
-}
-assigned_dev->girq = irq;
-}
+r = assign_irq(adev);
+if (r < 0)
+free_assigned_device(adev);
 
 adev = next;
 }
@@ -576,27 +612,10 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 dev->h_busnr = adev->bus;
 dev->h_devfn = PCI_DEVFN(adev->dev, adev->func);
 
-memset(&assigned_dev_data, 0, sizeof(assigned_dev_data));
-assigned_dev_data.assigned_dev_id  =
-   calc_assigned_dev_id(dev->h_busnr, (uint32_t)dev->h_devfn);
-assigned_dev_data.busnr = dev->h_busnr;
-assigned_dev_data.devfn = dev->h_devfn;
-
-#ifdef KVM_CAP_IOMMU
-/* We always enable the IOMMU if present
- * (or when not disabled on the command line)
- */
-r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU);
-if (r && !adev->disable_iommu)
-   assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU;
-#endif
-
-r = kvm_assign_pci_device(kvm_context, &assigned_dev_data);
-if (r < 0

[PATCH 4/7] [V3] kvm: qemu: free device on error in init_assigned_device

2009-02-13 Thread Han, Weidong
make init_assigned_device call free_assigned_device on error,
and then make free_assigned_device is static because it's only
invoked in device-assigned.c.

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 qemu/hw/device-assignment.c |   12 
 qemu/hw/device-assignment.h |1 -
 qemu/hw/device-hotplug.c|1 -
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index e6d2352..6046fdd 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -443,7 +443,7 @@ again:
 
 static LIST_HEAD(, AssignedDevInfo) adev_head;
 
-void free_assigned_device(AssignedDevInfo *adev)
+static void free_assigned_device(AssignedDevInfo *adev)
 {
 AssignedDevice *dev = adev->assigned_dev;
 
@@ -558,14 +558,14 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 if (get_real_device(dev, adev->bus, adev->dev, adev->func)) {
 fprintf(stderr, "%s: Error: Couldn't get real device (%s)!\n",
 __func__, adev->name);
-return NULL;
+goto out;
 }
 
 /* handle real device's MMIO/PIO BARs */
 if (assigned_dev_register_regions(dev->real_device.regions,
   dev->real_device.region_number,
   dev))
-return NULL;
+goto out;
 
 /* handle interrupt routing */
 e_device = (dev->dev.devfn >> 3) & 0x1f;
@@ -595,10 +595,14 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 if (r < 0) {
fprintf(stderr, "Failed to assign device \"%s\" : %s\n",
 adev->name, strerror(-r));
-   return NULL;
+   goto out;
 }
 
 return &dev->dev;
+
+out:
+free_assigned_device(adev);
+return NULL;
 }
 
 /*
diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h
index f216bb0..6a9b9fa 100644
--- a/qemu/hw/device-assignment.h
+++ b/qemu/hw/device-assignment.h
@@ -94,7 +94,6 @@ struct AssignedDevInfo {
 int disable_iommu;
 };
 
-void free_assigned_device(AssignedDevInfo *adev);
 PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus);
 AssignedDevInfo *add_assigned_device(const char *arg);
 void add_assigned_devices(PCIBus *bus, const char **devices, int n_devices);
diff --git a/qemu/hw/device-hotplug.c b/qemu/hw/device-hotplug.c
index d8f0fcc..671acb2 100644
--- a/qemu/hw/device-hotplug.c
+++ b/qemu/hw/device-hotplug.c
@@ -51,7 +51,6 @@ static PCIDevice *qemu_system_hot_assign_device(const char 
*opts, int bus_nr)
 ret = init_assigned_device(adev, pci_bus);
 if (ret == NULL) {
 term_printf("Failed to assign device\n");
-free_assigned_device(adev);
 return NULL;
 }
 
-- 
1.6.0.4


0004-kvm-qemu-free-device-on-error-in-init_assigned_dev.patch
Description: 0004-kvm-qemu-free-device-on-error-in-init_assigned_dev.patch


[PATCH 3/7] [V3] kvm: libkvm: add deassign ioctl

2009-02-13 Thread Han, Weidong
add this to support hot remove device with iommu

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 libkvm/libkvm.c |   14 ++
 libkvm/libkvm.h |   13 +
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c
index 92ffe10..a559a0d 100644
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -1141,6 +1141,20 @@ int kvm_assign_irq(kvm_context_t kvm,
 }
 #endif
 
+#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
+int kvm_deassign_pci_device(kvm_context_t kvm,
+   struct kvm_assigned_pci_dev *assigned_dev)
+{
+   int ret;
+
+   ret = ioctl(kvm->vm_fd, KVM_DEASSIGN_PCI_DEVICE, assigned_dev);
+   if (ret < 0)
+   return -errno;
+
+   return ret;
+}
+#endif
+
 int kvm_destroy_memory_region_works(kvm_context_t kvm)
 {
int ret = 0;
diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h
index e79e4fd..f6d653c 100644
--- a/libkvm/libkvm.h
+++ b/libkvm/libkvm.h
@@ -737,6 +737,19 @@ int kvm_assign_irq(kvm_context_t kvm,
 int kvm_destroy_memory_region_works(kvm_context_t kvm);
 #endif
 
+#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
+/*!
+ * \brief Notifies host kernel about a PCI device to be deassigned from a guest
+ *
+ * Used for hot remove PCI device, this function notifies the host
+ * kernel about the deassigning of the physical PCI device from a guest.
+ *
+ * \param kvm Pointer to the current kvm_context
+ * \param assigned_dev Parameters, like bus, devfn number, etc
+ */
+int kvm_deassign_pci_device(kvm_context_t kvm,
+   struct kvm_assigned_pci_dev *assigned_dev);
+#endif
 
 /*!
  * \brief Checks whether the generic irq routing capability is present
-- 
1.6.0.4


0003-kvm-libkvm-add-deassign-ioctl.patch
Description: 0003-kvm-libkvm-add-deassign-ioctl.patch


[PATCH 1/7] [V3] kvm: define KVM_CAP_DEVICE_DEASSIGNMENT

2009-02-13 Thread Han, Weidong
define KVM_CAP_DEVICE_DEASSIGNMENT and KVM_DEASSIGN_PCI_DEVICE
for device deassignment.

the ioctl has been already implemented in the
commit: 0a920356748df4fb06e86c21c23d2ed6d31d37ad

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 include/linux/kvm.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 869462c..02dfb1b 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -402,6 +402,9 @@ struct kvm_trace_rec {
 #ifdef __KVM_HAVE_IOAPIC
 #define KVM_CAP_IRQ_ROUTING 25
 #endif
+#ifdef __KVM_HAVE_DEVICE_ASSIGNMENT
+#define KVM_CAP_DEVICE_DEASSIGNMENT 26
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -466,6 +469,8 @@ struct kvm_irq_routing {
 #define KVM_ASSIGN_IRQ _IOR(KVMIO, 0x70, \
struct kvm_assigned_irq)
 #define KVM_REINJECT_CONTROL  _IO(KVMIO, 0x71)
+#define KVM_DEASSIGN_PCI_DEVICE _IOR(KVMIO, 0x72, \
+struct kvm_assigned_pci_dev)
 
 /*
  * ioctls for vcpu fds
-- 
1.6.0.4


0001-kvm-define-KVM_CAP_DEVICE_DEASSIGNMENT.patch
Description: 0001-kvm-define-KVM_CAP_DEVICE_DEASSIGNMENT.patch


[PATCH 0/7] [V3] kvm: fix hot remove assigned device with IOMMU

2009-02-13 Thread Han, Weidong
Changelog from v2 -> v3:

only need to set assigned_dev_id for deassignment in userspace, and use 
match->flags to judge and deassign it in kvm ioctl.

Regards,
Weidong--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] [V3] kvm: fix kvm_vm_ioctl_deassign_device

2009-02-13 Thread Han, Weidong
only need to set assigned_dev_id for deassignment, use
match->flags to judge and deassign it.

Acked-by: Mark McLoughlin 
Signed-off-by: Weidong Han 
---
 virt/kvm/kvm_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ef52622..3aacdb9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -574,7 +574,7 @@ static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
goto out;
}
 
-   if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
+   if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
kvm_deassign_device(kvm, match);
 
kvm_free_assigned_device(kvm, match);
-- 
1.6.0.4


0002-kvm-fix-kvm_vm_ioctl_deassign_device.patch
Description: 0002-kvm-fix-kvm_vm_ioctl_deassign_device.patch


RE: [PATCH 5/6] kvm: qemu: deassign device from guest

2009-02-13 Thread Han, Weidong
Mark McLoughlin wrote:
> On Fri, 2009-02-13 at 14:58 +0800, Han, Weidong wrote:
>> +static void deassign_device(AssignedDevInfo *adev) +{
>> +struct kvm_assigned_pci_dev assigned_dev_data;
>> +AssignedDevice *dev = adev->assigned_dev;
>> +int r;
>> +
>> +memset(&assigned_dev_data, 0, sizeof(assigned_dev_data));
>> +assigned_dev_data.assigned_dev_id  =
>> +   calc_assigned_dev_id(dev->h_busnr, dev->h_devfn);
> 
>> From here:
> 
>> +assigned_dev_data.busnr = dev->h_busnr;
>> +assigned_dev_data.devfn = dev->h_devfn;
>> +
>> +#ifdef KVM_CAP_IOMMU
>> +/* We always enable the IOMMU if present
>> + * (or when not disabled on the command line)
>> + */
>> +r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU);
>> +if (r && !adev->disable_iommu)
>> +   assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU;
>> +#endif
> 
> to here.
> 
> As, I said, these lines should not be needed and the ioctl should be
> fixed like so:
> 
> -   if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
> +   if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
> kvm_deassign_device(kvm, match);
> 

I ignored this comment, will update it.

Regards,
Weidong

> Cheers,
> Mark.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] kvm: qemu: deassign device from guest

2009-02-13 Thread Mark McLoughlin
On Fri, 2009-02-13 at 14:58 +0800, Han, Weidong wrote:
> +static void deassign_device(AssignedDevInfo *adev)
> +{
> +struct kvm_assigned_pci_dev assigned_dev_data;
> +AssignedDevice *dev = adev->assigned_dev;
> +int r;
> +
> +memset(&assigned_dev_data, 0, sizeof(assigned_dev_data));
> +assigned_dev_data.assigned_dev_id  =
> +   calc_assigned_dev_id(dev->h_busnr, dev->h_devfn);

>From here:

> +assigned_dev_data.busnr = dev->h_busnr;
> +assigned_dev_data.devfn = dev->h_devfn;
> +
> +#ifdef KVM_CAP_IOMMU
> +/* We always enable the IOMMU if present
> + * (or when not disabled on the command line)
> + */
> +r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU);
> +if (r && !adev->disable_iommu)
> +   assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU;
> +#endif

to here.

As, I said, these lines should not be needed and the ioctl should be
fixed like so:

-   if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
+   if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
kvm_deassign_device(kvm, match);

Cheers,
Mark.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html