Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread MORITA Kazutaka
At Fri, 21 May 2010 06:28:42 +0100,
Stefan Hajnoczi wrote:
> 
> On Thu, May 20, 2010 at 11:16 PM, Christian Brunner  wrote:
> > 2010/5/20 Anthony Liguori :
> >> Both sheepdog and ceph ultimately transmit I/O over a socket to a central
> >> daemon, right?  So could we not standardize a protocol for this that both
> >> sheepdog and ceph could implement?
> >
> > There is no central daemon. The concept is that they talk to many
> > storage nodes at the same time. Data is distributed and replicated
> > over many nodes in the network. The mechanism to do this is quite
> > complex. I don't know about sheepdog, but in Ceph this is called RADOS
> > (reliable autonomic distributed object store). Sheepdog and Ceph may
> > look similar, but this is where they act different. I don't think that
> > it would be possible to implement a common protocol.
> 
> I believe Sheepdog has a local daemon on each node.  The QEMU storage
> backend talks to the daemon on the same node, which then does the real
> network communication with the rest of the distributed storage system.

Yes.  It is because Sheepdog doesn't have a configuration about
cluster membership as I mentioned in another mail, so the drvier
doesn't know which node to access other than localhost.

>  So I think we're not talking about a network protocol here, we're
> talking about a common interface that can be used by QEMU and other
> programs to take advantage of Ceph, Sheepdog, etc services available
> on the local node.
> 
> Haven't looked into your patch enough yet, but does librados talk
> directly over the network or does it connect to a local daemon/driver?
> 

AFAIK, librados access directly over the network, so I think it is
difficult to define a common interface.


Thanks,

Kazutaka

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread MORITA Kazutaka
At Fri, 21 May 2010 00:16:46 +0200,
Christian Brunner wrote:
> 
> 2010/5/20 Anthony Liguori :
> >> With new approaches like Sheepdog or Ceph, things are getting a lot
> >> cheaper and you can scale your system without disrupting your service.
> >> The concepts are quite similar to what Amazon is doing in their EC2
> >> environment, but they certainly won't publish it as OpenSource anytime
> >> soon.
> >>
> >> Both projects have advantages and disadvantages. Ceph is a bit more
> >> universal as it implements a whole filesystem. Sheepdog is more
> >> feature complete in regards of managing images (e.g. snapshots). Both

I think a major difference is that Sheepdog servers act fully
autonomously.  Any Sheepdog server has no fixed role such as a monitor
server, and Sheepdog doesn't require any configuration about a list of
nodes in the cluster.


> >> projects require some additional work to become stable, but they are
> >> on a good way.
> >>
> >> I would really like to see both drivers in the qemu tree, as they are
> >> the key to a design shift in how storage in the datacenter is being
> >> built.
> >>
> >
> > I'd be more interested in enabling people to build these types of storage
> > systems without touching qemu.
> 
> You could do this by using Yehuda's rbd kernel driver, but I think
> that it would be better to avoid this additional layer.
> 

I agree.  In addition, if a storage client is a qemu driver, the
storage system can support some features specific to qemu such as live
snapshot from qemu monitor.

Regards,

Kazutaka

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Stefan Hajnoczi
On Thu, May 20, 2010 at 11:16 PM, Christian Brunner  wrote:
> 2010/5/20 Anthony Liguori :
>> Both sheepdog and ceph ultimately transmit I/O over a socket to a central
>> daemon, right?  So could we not standardize a protocol for this that both
>> sheepdog and ceph could implement?
>
> There is no central daemon. The concept is that they talk to many
> storage nodes at the same time. Data is distributed and replicated
> over many nodes in the network. The mechanism to do this is quite
> complex. I don't know about sheepdog, but in Ceph this is called RADOS
> (reliable autonomic distributed object store). Sheepdog and Ceph may
> look similar, but this is where they act different. I don't think that
> it would be possible to implement a common protocol.

I believe Sheepdog has a local daemon on each node.  The QEMU storage
backend talks to the daemon on the same node, which then does the real
network communication with the rest of the distributed storage system.
 So I think we're not talking about a network protocol here, we're
talking about a common interface that can be used by QEMU and other
programs to take advantage of Ceph, Sheepdog, etc services available
on the local node.

Haven't looked into your patch enough yet, but does librados talk
directly over the network or does it connect to a local daemon/driver?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST PATCH] KVM test: use command line option wrapper functions

2010-05-20 Thread Feng Yang

- "Lucas Meneghel Rodrigues"  wrote:

> From: "Lucas Meneghel Rodrigues" 
> To: "Michael Goldish" 
> Cc: "Feng Yang" , autot...@test.kernel.org, 
> kvm@vger.kernel.org
> Sent: Thursday, May 20, 2010 6:57:23 PM GMT +08:00 Beijing / Chongqing / Hong 
> Kong / Urumqi
> Subject: Re: [KVM-AUTOTEST PATCH] KVM test: use command line option wrapper 
> functions
>
> On Thu, 2010-05-20 at 12:50 +0300, Michael Goldish wrote:
> > On 05/19/2010 11:25 AM, Feng Yang wrote:
> > > Hi, Michael
> > > 
> > > Thanks for your patch.
> > > We plan add "netdev" parameter support in make_qemu_command. 
> Since you are working on this part. Could you add netdev support in
> your patch? hopeful netdev can be default supported in
> make_qemu_command if qemu support it. Thanks very much!
> > 
> > Sure, I'll look into it.
> > 
> > > I think the point of this patch is good and we need this kinds of
> patch.
> > > But I think we need not add so many new function.  Especially some
> function only directly return the string and do nothing more.
> > > This will increase the function call consumption.
> > > 
> > All these helper functions are meant to be extended and modified in
> the
> > future.  They're only there to minimize future effort involved in
> adding
> > support for new command line syntaxes.
> > Right now add_smp() just returns " -smp %s", but in the future we
> may
> > have to support different syntaxes for -smp, and then add_smp()
> will
> > consult the output of 'qemu -help' and return the proper string.
> > What do you mean by function call consumption?  I don't think these
> > functions cause a measurable slowdown, and make_qemu_command() is
> called
> > very few times, so this really isn't a concern IMO.
> 
> Agreed, the wrappers are a good strategy in the case we have to
> support
> different feature sets and syntax.

I know your meaning. Yes, the wrapper is a good strategy for some parameters.
For the new added feature which could not work on old kvm, We need check 
support 
before using it.  So i say we need this patch. But i still think so many one 
line function 
is not a good code style.

For many parameters that already support by all the kvm build, I do not think 
it will have big change on syntax later.
Because a mature software should ensure the stability of the interface.
So we need not add one line function for these parameters now.
Even these parameters change its interface later, then we update the code is ok.


Again. Thanks for your patch and comments.


> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm 2/2] device-assignment: Don't use libpci

2010-05-20 Thread Alex Williamson
On Thu, 2010-05-20 at 17:27 -0700, Chris Wright wrote:
> From: Alex Williamson 
> 
> We've already got an open fd for PCI config space for the device, we
> might as well use it.  This also makes sure that if we're making use of
> a privileged file descriptor opened for us, we use it for all accesses
> to the device.
> 
> Signed-off-by: Alex Williamson 
> [chrisw: kill pci_dev, configure check for header, narrow header requirement]
> Signed-off-by: Chris Wright 

Looks good, Thanks

Alex

> ---
>  configure  |   14 +-
>  hw/device-assignment.c |   68 
> 
>  hw/device-assignment.h |1 -
>  3 files changed, 58 insertions(+), 25 deletions(-)
> 
> diff --git a/configure b/configure
> index ed8e17b..8ca9e1e 100755
> --- a/configure
> +++ b/configure
> @@ -1623,20 +1623,20 @@ EOF
>  fi
>  
>  ##
> -# libpci probe for kvm_cap_device_assignment
> +# libpci header probe for kvm_cap_device_assignment
>  if test $kvm_cap_device_assignment = "yes" ; then
>cat > $TMPC << EOF
> -#include 
> +#include 
>  #ifndef PCI_VENDOR_ID
> -#error NO LIBPCI
> +#error NO LIBPCI HEADER
>  #endif
> -int main(void) { struct pci_access a; pci_init(&a); return 0; }
> +int main(void) { return 0; }
>  EOF
> -  if compile_prog "" "-lpci -lz" ; then
> -libs_softmmu="-lpci -lz $libs_softmmu"
> +  if compile_prog "" "" ; then
> +kvm_cap_device_assignment=yes
>else
>  echo
> -echo "Error: libpci check failed"
> +echo "Error: libpci header check failed"
>  echo "Disable KVM Device Assignment capability."
>  echo
>  kvm_cap_device_assignment=no
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index fd09ec3..d8e7cb4 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -37,7 +37,7 @@
>  #include "console.h"
>  #include "device-assignment.h"
>  #include "loader.h"
> -#include 
> +#include 
>  
>  /* From linux/ioport.h */
>  #define IORESOURCE_IO   0x0100  /* Resource type */
> @@ -335,24 +335,61 @@ static void assigned_dev_ioport_map(PCIDevice *pci_dev, 
> int region_num,
>(r_dev->v_addrs + region_num));
>  }
>  
> -static uint8_t pci_find_cap_offset(struct pci_dev *pci_dev, uint8_t cap)
> +static uint32_t assigned_dev_pci_read(PCIDevice *d, int pos, int len)
> +{
> +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
> +uint32_t val;
> +ssize_t ret;
> +int fd = pci_dev->real_device.config_fd;
> +
> +again:
> +ret = pread(fd, &val, len, pos);
> +if (ret != len) {
> + if ((ret < 0) && (errno == EINTR || errno == EAGAIN))
> + goto again;
> +
> + fprintf(stderr, "%s: pread failed, ret = %zd errno = %d\n",
> + __func__, ret, errno);
> +
> + exit(1);
> +}
> +
> +return val;
> +}
> +
> +static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos)
> +{
> +return (uint8_t)assigned_dev_pci_read(d, pos, 1);
> +}
> +
> +static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos)
> +{
> +return (uint16_t)assigned_dev_pci_read(d, pos, 2);
> +}
> +
> +static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos)
> +{
> +return assigned_dev_pci_read(d, pos, 4);
> +}
> +
> +static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
>  {
>  int id;
>  int max_cap = 48;
>  int pos = PCI_CAPABILITY_LIST;
>  int status;
>  
> -status = pci_read_byte(pci_dev, PCI_STATUS);
> +status = assigned_dev_pci_read_byte(d, PCI_STATUS);
>  if ((status & PCI_STATUS_CAP_LIST) == 0)
>  return 0;
>  
>  while (max_cap--) {
> -pos = pci_read_byte(pci_dev, pos);
> +pos = assigned_dev_pci_read_byte(d, pos);
>  if (pos < 0x40)
>  break;
>  
>  pos &= ~3;
> -id = pci_read_byte(pci_dev, pos + PCI_CAP_LIST_ID);
> +id = assigned_dev_pci_read_byte(d, pos + PCI_CAP_LIST_ID);
>  
>  if (id == 0xff)
>  break;
> @@ -858,7 +895,7 @@ static int assign_irq(AssignedDevice *dev)
>  int irq, r = 0;
>  
>  /* Interrupt PIN 0 means don't use INTx */
> -if (pci_read_byte(dev->pdev, PCI_INTERRUPT_PIN) == 0)
> +if (assigned_dev_pci_read_byte(&dev->dev, PCI_INTERRUPT_PIN) == 0)
>  return 0;
>  
>  irq = pci_map_irq(&dev->dev, dev->intpin);
> @@ -1196,7 +1233,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
> *pci_dev)
>  #ifdef KVM_CAP_DEVICE_MSI
>  /* Expose MSI capability
>   * MSI capability is the 1st capability in capability config */
> -if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSI)) {
> +if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) {
>  dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
>  memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
> 0, PCI_CAPABILITY_CONFIG_MSI_LENGTH);
> @@ -1208,23 +1245,25 @@ static int assigned_device_pci_cap_init(PCIDevice 
> *pc

Re: [PATCH qemu-kvm 1/2] device-assignment: use stdint types

2010-05-20 Thread Alex Williamson
On Thu, 2010-05-20 at 17:25 -0700, Chris Wright wrote:
> Use stdint types to avoid extra reliance on pci/pci.h header.
> 
> Cc: Alex Williamson 
> Signed-off-by: Chris Wright 
> ---
>  hw/device-assignment.c |6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)

Acked-by: Alex Williamson 

> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index eb31c78..fd09ec3 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1024,9 +1024,9 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
> unsigned int ctrl_pos)
>  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
>  {
>  AssignedDevice *adev = container_of(pci_dev, AssignedDevice, dev);
> -u16 entries_nr = 0, entries_max_nr;
> +uint16_t entries_nr = 0, entries_max_nr;
>  int pos = 0, i, r = 0;
> -u32 msg_addr, msg_upper_addr, msg_data, msg_ctrl;
> +uint32_t msg_addr, msg_upper_addr, msg_data, msg_ctrl;
>  struct kvm_assigned_msix_nr msix_nr;
>  struct kvm_assigned_msix_entry msix_entry;
>  void *va = adev->msix_table_page;
> @@ -1210,7 +1210,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
> *pci_dev)
>  /* Expose MSI-X capability */
>  if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSIX)) {
>  int pos, entry_nr, bar_nr;
> -u32 msix_table_entry;
> +uint32_t msix_table_entry;
>  dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
>  memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
> 0, PCI_CAPABILITY_CONFIG_MSIX_LENGTH);



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH qemu-kvm 2/2] device-assignment: Don't use libpci

2010-05-20 Thread Chris Wright
From: Alex Williamson 

We've already got an open fd for PCI config space for the device, we
might as well use it.  This also makes sure that if we're making use of
a privileged file descriptor opened for us, we use it for all accesses
to the device.

Signed-off-by: Alex Williamson 
[chrisw: kill pci_dev, configure check for header, narrow header requirement]
Signed-off-by: Chris Wright 
---
 configure  |   14 +-
 hw/device-assignment.c |   68 
 hw/device-assignment.h |1 -
 3 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/configure b/configure
index ed8e17b..8ca9e1e 100755
--- a/configure
+++ b/configure
@@ -1623,20 +1623,20 @@ EOF
 fi
 
 ##
-# libpci probe for kvm_cap_device_assignment
+# libpci header probe for kvm_cap_device_assignment
 if test $kvm_cap_device_assignment = "yes" ; then
   cat > $TMPC << EOF
-#include 
+#include 
 #ifndef PCI_VENDOR_ID
-#error NO LIBPCI
+#error NO LIBPCI HEADER
 #endif
-int main(void) { struct pci_access a; pci_init(&a); return 0; }
+int main(void) { return 0; }
 EOF
-  if compile_prog "" "-lpci -lz" ; then
-libs_softmmu="-lpci -lz $libs_softmmu"
+  if compile_prog "" "" ; then
+kvm_cap_device_assignment=yes
   else
 echo
-echo "Error: libpci check failed"
+echo "Error: libpci header check failed"
 echo "Disable KVM Device Assignment capability."
 echo
 kvm_cap_device_assignment=no
diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index fd09ec3..d8e7cb4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -37,7 +37,7 @@
 #include "console.h"
 #include "device-assignment.h"
 #include "loader.h"
-#include 
+#include 
 
 /* From linux/ioport.h */
 #define IORESOURCE_IO   0x0100  /* Resource type */
@@ -335,24 +335,61 @@ static void assigned_dev_ioport_map(PCIDevice *pci_dev, 
int region_num,
   (r_dev->v_addrs + region_num));
 }
 
-static uint8_t pci_find_cap_offset(struct pci_dev *pci_dev, uint8_t cap)
+static uint32_t assigned_dev_pci_read(PCIDevice *d, int pos, int len)
+{
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+uint32_t val;
+ssize_t ret;
+int fd = pci_dev->real_device.config_fd;
+
+again:
+ret = pread(fd, &val, len, pos);
+if (ret != len) {
+   if ((ret < 0) && (errno == EINTR || errno == EAGAIN))
+   goto again;
+
+   fprintf(stderr, "%s: pread failed, ret = %zd errno = %d\n",
+   __func__, ret, errno);
+
+   exit(1);
+}
+
+return val;
+}
+
+static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos)
+{
+return (uint8_t)assigned_dev_pci_read(d, pos, 1);
+}
+
+static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos)
+{
+return (uint16_t)assigned_dev_pci_read(d, pos, 2);
+}
+
+static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos)
+{
+return assigned_dev_pci_read(d, pos, 4);
+}
+
+static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
 {
 int id;
 int max_cap = 48;
 int pos = PCI_CAPABILITY_LIST;
 int status;
 
-status = pci_read_byte(pci_dev, PCI_STATUS);
+status = assigned_dev_pci_read_byte(d, PCI_STATUS);
 if ((status & PCI_STATUS_CAP_LIST) == 0)
 return 0;
 
 while (max_cap--) {
-pos = pci_read_byte(pci_dev, pos);
+pos = assigned_dev_pci_read_byte(d, pos);
 if (pos < 0x40)
 break;
 
 pos &= ~3;
-id = pci_read_byte(pci_dev, pos + PCI_CAP_LIST_ID);
+id = assigned_dev_pci_read_byte(d, pos + PCI_CAP_LIST_ID);
 
 if (id == 0xff)
 break;
@@ -858,7 +895,7 @@ static int assign_irq(AssignedDevice *dev)
 int irq, r = 0;
 
 /* Interrupt PIN 0 means don't use INTx */
-if (pci_read_byte(dev->pdev, PCI_INTERRUPT_PIN) == 0)
+if (assigned_dev_pci_read_byte(&dev->dev, PCI_INTERRUPT_PIN) == 0)
 return 0;
 
 irq = pci_map_irq(&dev->dev, dev->intpin);
@@ -1196,7 +1233,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
  * MSI capability is the 1st capability in capability config */
-if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSI)) {
+if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) {
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
 memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
0, PCI_CAPABILITY_CONFIG_MSI_LENGTH);
@@ -1208,23 +1245,25 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 #endif
 #ifdef KVM_CAP_DEVICE_MSIX
 /* Expose MSI-X capability */
-if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSIX)) {
+if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX)) {
 int pos, entry_nr, bar_nr;
 uint32_t msix_table_entry;
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
 memset(&pci_dev->config[pci_dev->cap.st

[PATCH qemu-kvm 1/2] device-assignment: use stdint types

2010-05-20 Thread Chris Wright
Use stdint types to avoid extra reliance on pci/pci.h header.

Cc: Alex Williamson 
Signed-off-by: Chris Wright 
---
 hw/device-assignment.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index eb31c78..fd09ec3 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1024,9 +1024,9 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
 {
 AssignedDevice *adev = container_of(pci_dev, AssignedDevice, dev);
-u16 entries_nr = 0, entries_max_nr;
+uint16_t entries_nr = 0, entries_max_nr;
 int pos = 0, i, r = 0;
-u32 msg_addr, msg_upper_addr, msg_data, msg_ctrl;
+uint32_t msg_addr, msg_upper_addr, msg_data, msg_ctrl;
 struct kvm_assigned_msix_nr msix_nr;
 struct kvm_assigned_msix_entry msix_entry;
 void *va = adev->msix_table_page;
@@ -1210,7 +1210,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 /* Expose MSI-X capability */
 if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSIX)) {
 int pos, entry_nr, bar_nr;
-u32 msix_table_entry;
+uint32_t msix_table_entry;
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
 memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
0, PCI_CAPABILITY_CONFIG_MSIX_LENGTH);
-- 
1.6.6.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] device-assignment: Don't use libpci

2010-05-20 Thread Chris Wright
* Chris Wright (chr...@redhat.com) wrote:
> * Alex Williamson (alex.william...@redhat.com) wrote:
> > We've already got an open fd for PCI config space for the device, we
> > might as well use it.  This also makes sure that if we're making use of
> > a privileged file descriptor opened for us, we use it for all accesses
> > to the device.
> > 
> > Signed-off-by: Alex Williamson 
> 
> Looks good to me.

Hmm, it's not that straight forward.  We still have a build requirement
for libpci.  Not for the library, but the headers.  This leaves the
check for the header, drops the unused pci_dev, and narrows our header
requirement.

thanks,
-chris
> Acked-by: Chris Wright 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Yehuda Sadeh Weinraub
On Thu, May 20, 2010 at 1:31 PM, Blue Swirl  wrote:
> On Wed, May 19, 2010 at 7:22 PM, Christian Brunner  wrote:
>> The attached patch is a block driver for the distributed file system
>> Ceph (http://ceph.newdream.net/). This driver uses librados (which
>> is part of the Ceph server) for direct access to the Ceph object
>> store and is running entirely in userspace. Therefore it is
>> called "rbd" - rados block device.
...
>
> IIRC underscores here may conflict with system header use. Please use
> something like QEMU_BLOCK_RADOS_H.

This header is shared between the linux kernel client and the ceph
userspace servers and client. We can actually get rid of it, as we
only need it to define CEPH_OSD_TMAP_SET. We can move this definition
to librados.h.

>> diff --git a/block/rbd_types.h b/block/rbd_types.h
>> new file mode 100644
>> index 000..dfd5aa0
>> --- /dev/null
>> +++ b/block/rbd_types.h
>> @@ -0,0 +1,48 @@
>> +#ifndef _FS_CEPH_RBD
>> +#define _FS_CEPH_RBD
>
> QEMU_BLOCK_RBD?

This header is shared between the ceph kernel client, between the qemu
rbd module (and between other ceph utilities). It'd be much easier
maintaining it without having to have a different implementation for
each. The same goes to the use of __le32/64 and __u32/64 within these
headers.

>
>> +
>> +#include 
>
> Can you use standard includes, like  or ? Are
> Ceph libraries used in other systems than Linux?

Not at the moment. I guess that we can take this include out.

>
>> +
>> +/*
>> + * rbd image 'foo' consists of objects
>> + *   foo.rbd      - image metadata
>> + *   foo.
>> + *   foo.0001
>> + *   ...          - data
>> + */
>> +
>> +#define RBD_SUFFIX             ".rbd"
>> +#define RBD_DIRECTORY           "rbd_directory"
>> +
>> +#define RBD_DEFAULT_OBJ_ORDER  22   /* 4MB */
>> +
>> +#define RBD_MAX_OBJ_NAME_SIZE  96
>> +#define RBD_MAX_SEG_NAME_SIZE  128
>> +
>> +#define RBD_COMP_NONE          0
>> +#define RBD_CRYPT_NONE         0
>> +
>> +static const char rbd_text[] = "<<< Rados Block Device Image >>>\n";
>> +static const char rbd_signature[] = "RBD";
>> +static const char rbd_version[] = "001.001";
>> +
>> +struct rbd_obj_snap_ondisk {
>> +       __le64 id;
>> +       __le64 image_size;
>> +} __attribute__((packed));
>> +
>> +struct rbd_obj_header_ondisk {
>> +       char text[64];
>> +       char signature[4];
>> +       char version[8];
>> +       __le64 image_size;
>
> Unaligned? Is the disk format fixed?

This is a packed structure that represents the on disk format.
Operations on it are being done only to read from the disk header or
to write to the disk header.


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-20 Thread Paul Menage
On Thu, May 20, 2010 at 3:22 PM, Paul Menage  wrote:
> On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala
>  wrote:
>> Add a new kernel API to attach a task to current task's cgroup
>> in all the active hierarchies.
>>
>> Signed-off-by: Sridhar Samudrala 
>
> Reviewed-by: Paul Menage 
>

One other thought on this - this would be the first piece of code
that's attaching a task to a cgroup without holding the cgroup
directory inode i_mutex. I believe that this is probably OK.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-20 Thread Paul Menage
On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala
 wrote:
> Add a new kernel API to attach a task to current task's cgroup
> in all the active hierarchies.
>
> Signed-off-by: Sridhar Samudrala 

Reviewed-by: Paul Menage 

It would be more efficient to just attach directly to current->cgroups
rather than potentially creating/destroying one css_set for each
hierarchy until we've completely converged on current->cgroups - but
that would require a bunch of refactoring of the guts of
cgroup_attach_task() to ensure that the right can_attach()/attach()
callbacks are made. That doesn't really seem worthwhile right now for
the initial use, that I imagine isn't going to be
performance-sensitive.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Christian Brunner
2010/5/20 Anthony Liguori :
>> With new approaches like Sheepdog or Ceph, things are getting a lot
>> cheaper and you can scale your system without disrupting your service.
>> The concepts are quite similar to what Amazon is doing in their EC2
>> environment, but they certainly won't publish it as OpenSource anytime
>> soon.
>>
>> Both projects have advantages and disadvantages. Ceph is a bit more
>> universal as it implements a whole filesystem. Sheepdog is more
>> feature complete in regards of managing images (e.g. snapshots). Both
>> projects require some additional work to become stable, but they are
>> on a good way.
>>
>> I would really like to see both drivers in the qemu tree, as they are
>> the key to a design shift in how storage in the datacenter is being
>> built.
>>
>
> I'd be more interested in enabling people to build these types of storage
> systems without touching qemu.

You could do this by using Yehuda's rbd kernel driver, but I think
that it would be better to avoid this additional layer.

> Both sheepdog and ceph ultimately transmit I/O over a socket to a central
> daemon, right?  So could we not standardize a protocol for this that both
> sheepdog and ceph could implement?

There is no central daemon. The concept is that they talk to many
storage nodes at the same time. Data is distributed and replicated
over many nodes in the network. The mechanism to do this is quite
complex. I don't know about sheepdog, but in Ceph this is called RADOS
(reliable autonomic distributed object store). Sheepdog and Ceph may
look similar, but this is where they act different. I don't think that
it would be possible to implement a common protocol.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] device-assignment: Don't use libpci

2010-05-20 Thread Chris Wright
* Alex Williamson (alex.william...@redhat.com) wrote:
> We've already got an open fd for PCI config space for the device, we
> might as well use it.  This also makes sure that if we're making use of
> a privileged file descriptor opened for us, we use it for all accesses
> to the device.
> 
> Signed-off-by: Alex Williamson 

Looks good to me.

Acked-by: Chris Wright 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] device-assignment: Don't use libpci

2010-05-20 Thread Alex Williamson
We've already got an open fd for PCI config space for the device, we
might as well use it.  This also makes sure that if we're making use of
a privileged file descriptor opened for us, we use it for all accesses
to the device.

Signed-off-by: Alex Williamson 
---

 configure  |   21 ---
 hw/device-assignment.c |   66 
 2 files changed, 50 insertions(+), 37 deletions(-)

diff --git a/configure b/configure
index ed8e17b..632d4b0 100755
--- a/configure
+++ b/configure
@@ -1623,27 +1623,6 @@ EOF
 fi
 
 ##
-# libpci probe for kvm_cap_device_assignment
-if test $kvm_cap_device_assignment = "yes" ; then
-  cat > $TMPC << EOF
-#include 
-#ifndef PCI_VENDOR_ID
-#error NO LIBPCI
-#endif
-int main(void) { struct pci_access a; pci_init(&a); return 0; }
-EOF
-  if compile_prog "" "-lpci -lz" ; then
-libs_softmmu="-lpci -lz $libs_softmmu"
-  else
-echo
-echo "Error: libpci check failed"
-echo "Disable KVM Device Assignment capability."
-echo
-kvm_cap_device_assignment=no
-  fi
-fi
-
-##
 # test for vhost net
 
 if test "$vhost_net" != "no"; then
diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 172f0c9..e6b34ac 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -335,24 +335,61 @@ static void assigned_dev_ioport_map(PCIDevice *pci_dev, 
int region_num,
   (r_dev->v_addrs + region_num));
 }
 
-static uint8_t pci_find_cap_offset(struct pci_dev *pci_dev, uint8_t cap)
+static uint32_t assigned_dev_pci_read(PCIDevice *d, int pos, int len)
+{
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+uint32_t val;
+ssize_t ret;
+int fd = pci_dev->real_device.config_fd;
+
+again:
+ret = pread(fd, &val, len, pos);
+if (ret != len) {
+   if ((ret < 0) && (errno == EINTR || errno == EAGAIN))
+   goto again;
+
+   fprintf(stderr, "%s: pread failed, ret = %zd errno = %d\n",
+   __func__, ret, errno);
+
+   exit(1);
+}
+
+return val;
+}
+
+static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos)
+{
+return (uint8_t)assigned_dev_pci_read(d, pos, 1);
+}
+
+static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos)
+{
+return (uint16_t)assigned_dev_pci_read(d, pos, 2);
+}
+
+static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos)
+{
+return assigned_dev_pci_read(d, pos, 4);
+}
+
+static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
 {
 int id;
 int max_cap = 48;
 int pos = PCI_CAPABILITY_LIST;
 int status;
 
-status = pci_read_byte(pci_dev, PCI_STATUS);
+status = assigned_dev_pci_read_byte(d, PCI_STATUS);
 if ((status & PCI_STATUS_CAP_LIST) == 0)
 return 0;
 
 while (max_cap--) {
-pos = pci_read_byte(pci_dev, pos);
+pos = assigned_dev_pci_read_byte(d, pos);
 if (pos < 0x40)
 break;
 
 pos &= ~3;
-id = pci_read_byte(pci_dev, pos + PCI_CAP_LIST_ID);
+id = assigned_dev_pci_read_byte(d, pos + PCI_CAP_LIST_ID);
 
 if (id == 0xff)
 break;
@@ -861,7 +898,7 @@ static int assign_irq(AssignedDevice *dev)
 int irq, r = 0;
 
 /* Interrupt PIN 0 means don't use INTx */
-if (pci_read_byte(dev->pdev, PCI_INTERRUPT_PIN) == 0)
+if (assigned_dev_pci_read_byte(&dev->dev, PCI_INTERRUPT_PIN) == 0)
 return 0;
 
 irq = pci_map_irq(&dev->dev, dev->intpin);
@@ -1199,7 +1236,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
  * MSI capability is the 1st capability in capability config */
-if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSI)) {
+if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) {
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
 memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
0, PCI_CAPABILITY_CONFIG_MSI_LENGTH);
@@ -1211,23 +1248,25 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 #endif
 #ifdef KVM_CAP_DEVICE_MSIX
 /* Expose MSI-X capability */
-if (pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSIX)) {
+if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX)) {
 int pos, entry_nr, bar_nr;
 u32 msix_table_entry;
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
 memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
0, PCI_CAPABILITY_CONFIG_MSIX_LENGTH);
-pos = pci_find_cap_offset(dev->pdev, PCI_CAP_ID_MSIX);
-entry_nr = pci_read_word(dev->pdev, pos + 2) & PCI_MSIX_TABSIZE;
+pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX);
+entry_nr = assigned_dev_pci_read_word(pci_dev, pos + 2) &
+ PCI_MSIX_TABSIZE;
 pci_dev->config[pci_dev->cap.start + pci_dev->

Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-20 Thread Michael Tokarev

20.05.2010 11:15, Andre Przywara wrote:

Michael Tokarev wrote:

[]

It'd be nice if we had more flexibility in defining custom machine types
so you could just do qemu -M win98.


This is wrong IMHO. win98 and winNT can run on various different
machines, including all modern ones (yes I tried the same winNT
on my Athlon X2-64, just had to switch SATA from AHCI to IDE;
win95 works too)... just not in kvm :)



Well, not really. You were lucky with your Athlon X2-64, actually it is
the last machine not triggering the bug. I tried it on a AthlonII-X4
(which has maxleaf=5 as any newer AMD machines) and it showed the same
bug. On Intel boxes this bug should trigger on every CPU starting with
some Pentium4 models, including all Core chips.
Have you tried versions with a newer service pack (SP6)?


I replied in the original discussion -- after upgrading to SP6
there's no need in ,level=1 anymore, any -cpu variant works
without crashes.  The problem is to set it up, at least for
me, since I don't have sp6 integrated into setup.  Well, I
don't use winNT to start with, actually, so for me it's not
a problem at all ;) -- the reason why I asked is because I
have a debian bugreport about this very issue, see
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=575439
(and because I had winNT install handy)

But this is really interesting information - that winNT
fails on other CPUs too.  Thank you for that, now I can
close the debian bugreport ;)


BTW: Does anyone knows what the problem with Windows95/98 on KVM is? I
tried some tracing today, but couldn't find a hint.


Um. The bugreport(s) come as a surprize for me: I tried to
install win98 in kvm several times in the past but setup
always failed - different messages in different versions
of kvm, either "unable to emulate" or "real mode trap" or
something else, or just lockup, usually on first reboot.
So - the bugreports talks about mouse non-working, but
this means win98 itself works somehow... I dunno :)



I think these bug reports are about plain QEMU. I tried it yesterday, in
fact the mouse is non-functional. In KVM Windows95 gives me a black
screen after the welcome screen with the moving bottom row. There are
just two lines at the top: (translated from the german version)
While initializing device NTKERN:
Windows protection fault. Restart the computer.


Yeah, that's what i've seen too, it's exactly
ow it fails here with modern kvm.


KVM catched some #UDs due to ARPL from VM86 mode, but TCG got them too
and it survived. So if anyone has some more hints, I'd be grateful.



Thank you!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Anthony Liguori

On 05/20/2010 04:18 PM, Christian Brunner wrote:

Thanks for your comments. I'll send an updated patch in a few days.

Having a central storage system is quite essential in larger hosting
environments, it enables you to move your guest systems from one node
to another easily (live-migration or dynamic restart). Traditionally
this has been done using SAN, iSCSI or NFS. However most of these
systems don't scale very well and and the costs for high-availability
are quite high.

With new approaches like Sheepdog or Ceph, things are getting a lot
cheaper and you can scale your system without disrupting your service.
The concepts are quite similar to what Amazon is doing in their EC2
environment, but they certainly won't publish it as OpenSource anytime
soon.

Both projects have advantages and disadvantages. Ceph is a bit more
universal as it implements a whole filesystem. Sheepdog is more
feature complete in regards of managing images (e.g. snapshots). Both
projects require some additional work to become stable, but they are
on a good way.

I would really like to see both drivers in the qemu tree, as they are
the key to a design shift in how storage in the datacenter is being
built.
   


I'd be more interested in enabling people to build these types of 
storage systems without touching qemu.


Both sheepdog and ceph ultimately transmit I/O over a socket to a 
central daemon, right?  So could we not standardize a protocol for this 
that both sheepdog and ceph could implement?


Regards,

Anthony Liguori


Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Christian Brunner
2010/5/20 Blue Swirl :
> On Wed, May 19, 2010 at 7:22 PM, Christian Brunner  wrote:
>> The attached patch is a block driver for the distributed file system
>> Ceph (http://ceph.newdream.net/). This driver uses librados (which
>> is part of the Ceph server) for direct access to the Ceph object
>> store and is running entirely in userspace. Therefore it is
>> called "rbd" - rados block device.
>>
>> To compile the driver a recent version of ceph (>= 0.20.1) is needed
>> and you have to "--enable-rbd" when running configure.
>>
>> Additional information is available on the Ceph-Wiki:
>>
>> http://ceph.newdream.net/wiki/Kvm-rbd
>
>
> I have no idea whether it makes sense to add Ceph (no objection
> either). I have some minor comments below.

Thanks for your comments. I'll send an updated patch in a few days.

Having a central storage system is quite essential in larger hosting
environments, it enables you to move your guest systems from one node
to another easily (live-migration or dynamic restart). Traditionally
this has been done using SAN, iSCSI or NFS. However most of these
systems don't scale very well and and the costs for high-availability
are quite high.

With new approaches like Sheepdog or Ceph, things are getting a lot
cheaper and you can scale your system without disrupting your service.
The concepts are quite similar to what Amazon is doing in their EC2
environment, but they certainly won't publish it as OpenSource anytime
soon.

Both projects have advantages and disadvantages. Ceph is a bit more
universal as it implements a whole filesystem. Sheepdog is more
feature complete in regards of managing images (e.g. snapshots). Both
projects require some additional work to become stable, but they are
on a good way.

I would really like to see both drivers in the qemu tree, as they are
the key to a design shift in how storage in the datacenter is being
built.

Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pc: fix segfault introduced by 3d53f5c36ff6

2010-05-20 Thread Blue Swirl
Good catch. Thanks, applied.


On Thu, May 20, 2010 at 6:14 AM, Eduard - Gabriel Munteanu
 wrote:
> Commit 3d53f5c36ff6 introduced a segfault by erroneously making fw_cfg a
> 'void **' and passing it around in different ways.
>
> Signed-off-by: Eduard - Gabriel Munteanu 
> ---
>  hw/pc.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/pc.c b/hw/pc.c
> index fee08c9..4a4a706 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -822,7 +822,7 @@ void pc_memory_init(ram_addr_t ram_size,
>     ram_addr_t ram_addr, bios_offset, option_rom_offset;
>     ram_addr_t below_4g_mem_size, above_4g_mem_size = 0;
>     int bios_size, isa_bios_size;
> -    void **fw_cfg;
> +    void *fw_cfg;
>
>     if (ram_size >= 0xe000 ) {
>         above_4g_mem_size = ram_size - 0xe000;
> @@ -905,7 +905,7 @@ void pc_memory_init(ram_addr_t ram_size,
>     rom_set_fw(fw_cfg);
>
>     if (linux_boot) {
> -        load_linux(*fw_cfg, kernel_filename, initrd_filename, 
> kernel_cmdline, below_4g_mem_size);
> +        load_linux(fw_cfg, kernel_filename, initrd_filename, kernel_cmdline, 
> below_4g_mem_size);
>     }
>
>     for (i = 0; i < nb_option_roms; i++) {
> --
> 1.6.4.4
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Blue Swirl
On Wed, May 19, 2010 at 7:22 PM, Christian Brunner  wrote:
> The attached patch is a block driver for the distributed file system
> Ceph (http://ceph.newdream.net/). This driver uses librados (which
> is part of the Ceph server) for direct access to the Ceph object
> store and is running entirely in userspace. Therefore it is
> called "rbd" - rados block device.
>
> To compile the driver a recent version of ceph (>= 0.20.1) is needed
> and you have to "--enable-rbd" when running configure.
>
> Additional information is available on the Ceph-Wiki:
>
> http://ceph.newdream.net/wiki/Kvm-rbd


I have no idea whether it makes sense to add Ceph (no objection
either). I have some minor comments below.

>
> ---
>  Makefile          |    3 +
>  Makefile.objs     |    1 +
>  block/rados.h     |  376 ++
>  block/rbd.c       |  585 
> +
>  block/rbd_types.h |   48 +
>  configure         |   27 +++
>  6 files changed, 1040 insertions(+), 0 deletions(-)
>  create mode 100644 block/rados.h
>  create mode 100644 block/rbd.c
>  create mode 100644 block/rbd_types.h
>
> diff --git a/Makefile b/Makefile
> index eb9e02b..b1ab3e9 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -27,6 +27,9 @@ configure: ;
>  $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)
>
>  LIBS+=-lz $(LIBS_TOOLS)
> +ifdef CONFIG_RBD
> +LIBS+=-lrados
> +endif
>
>  ifdef BUILD_DOCS
>  DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8
> diff --git a/Makefile.objs b/Makefile.objs
> index acbaf22..85791ac 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -18,6 +18,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o
>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>  block-nested-$(CONFIG_POSIX) += raw-posix.o
>  block-nested-$(CONFIG_CURL) += curl.o
> +block-nested-$(CONFIG_RBD) += rbd.o
>
>  block-obj-y +=  $(addprefix block/, $(block-nested-y))
>
> diff --git a/block/rados.h b/block/rados.h
> new file mode 100644
> index 000..6cde9a1
> --- /dev/null
> +++ b/block/rados.h
> @@ -0,0 +1,376 @@
> +#ifndef __RADOS_H
> +#define __RADOS_H

IIRC underscores here may conflict with system header use. Please use
something like QEMU_BLOCK_RADOS_H.

> +
> +/*
> + * Data types for the Ceph distributed object storage layer RADOS
> + * (Reliable Autonomic Distributed Object Store).
> + */
> +
> +
> +
> +/*
> + * osdmap encoding versions
> + */
> +#define CEPH_OSDMAP_INC_VERSION     5
> +#define CEPH_OSDMAP_INC_VERSION_EXT 5
> +#define CEPH_OSDMAP_VERSION         5
> +#define CEPH_OSDMAP_VERSION_EXT     5
> +
> +/*
> + * fs id
> + */
> +struct ceph_fsid {
> +       unsigned char fsid[16];

Too large indent, please check also elsewhere.

> +};
> +
> +static inline int ceph_fsid_compare(const struct ceph_fsid *a,
> +                                   const struct ceph_fsid *b)
> +{
> +       return memcmp(a, b, sizeof(*a));
> +}
> +
> +/*
> + * ino, object, etc.
> + */
> +typedef __le64 ceph_snapid_t;

Please use uint64_t and le_to_cpu()/cpu_to_le().

> +#define CEPH_SNAPDIR ((__u64)(-1))  /* reserved for hidden .snap dir */

Likewise, uint64_t is the standard type. Also other places.

> +#define CEPH_NOSNAP  ((__u64)(-2))  /* "head", "live" revision */
> +#define CEPH_MAXSNAP ((__u64)(-3))  /* largest valid snapid */
> +
> +struct ceph_timespec {
> +       __le32 tv_sec;
> +       __le32 tv_nsec;
> +} __attribute__ ((packed));
> +
> +
> +/*
> + * object layout - how objects are mapped into PGs
> + */
> +#define CEPH_OBJECT_LAYOUT_HASH     1
> +#define CEPH_OBJECT_LAYOUT_LINEAR   2
> +#define CEPH_OBJECT_LAYOUT_HASHINO  3
> +
> +/*
> + * pg layout -- how PGs are mapped onto (sets of) OSDs
> + */
> +#define CEPH_PG_LAYOUT_CRUSH  0
> +#define CEPH_PG_LAYOUT_HASH   1
> +#define CEPH_PG_LAYOUT_LINEAR 2
> +#define CEPH_PG_LAYOUT_HYBRID 3
> +
> +
> +/*
> + * placement group.
> + * we encode this into one __le64.
> + */
> +struct ceph_pg {
> +       __le16 preferred; /* preferred primary osd */
> +       __le16 ps;        /* placement seed */
> +       __le32 pool;      /* object pool */
> +} __attribute__ ((packed));
> +
> +/*
> + * pg_pool is a set of pgs storing a pool of objects
> + *
> + *  pg_num -- base number of pseudorandomly placed pgs
> + *
> + *  pgp_num -- effective number when calculating pg placement.  this
> + * is used for pg_num increases.  new pgs result in data being "split"
> + * into new pgs.  for this to proceed smoothly, new pgs are intiially
> + * colocated with their parents; that is, pgp_num doesn't increase
> + * until the new pgs have successfully split.  only _then_ are the new
> + * pgs placed independently.
> + *
> + *  lpg_num -- localized pg count (per device).  replicas are randomly
> + * selected.
> + *
> + *  lpgp_num -- as above.
> + */
> +#define CEPH_PG_TYPE_REP     1
> +#define CEPH_PG_TYPE_RAID4   2
> +#define CEPH_PG_POOL_VERSION 2
> +struct ceph_pg_pool {
> +       __u8 type;                /* CEPH_PG_TYPE_* */
> +       __u8 size;         

Support for direct inter-VM sockets? Inter-VM shared memory?

2010-05-20 Thread Tyler Bletsch
I'm interested in moving some research prototypes from Xen to KVM, but 
there are a few esoteric features I'd need to do this.


First is an efficient mechanism for direct VM-to-VM sockets...something 
that bypasses the protocol stack and minimizes overhead.  Xen has 
XenSocket, XenLoop, and others.  I found a few mentions of this idea 
dating back to 2006*, mostly saying that a few people have done 
something like this as a one-off, but nothing official has been 
released.  I haven't found anything like this more recently, though.  
Has there been any progress on this front?  Ideally, I'd want a 
character device or a special program that acts as a fast pipe to a 
different VM.


Second, what about inter-VM shared memory?

Apologies if I'm missing some well known doc...I search around google & 
the wiki to no avail.


Thanks,
Tyler Bletsch

* http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg00304.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: computer frozen

2010-05-20 Thread Brian Jackson
On Thursday, May 20, 2010 02:03:31 am magicboiz wrote:
> Hello
> 
> since kernel 2.6.28 or 2.6.29, I don't remember exactly, whenever I try to
> run KVM in my laptop, I get my computer totally frozen.
> 
> I'd try:
>  - "-no-kvm" flag: works, but very slow
>  - "-cpu qemu32,-nx": frozen
>  - "-no-acpi" flag: frozen
> 
> I'd try with several kernels (ubuntu and openssuse kernels), also with
> custom kernels compiled by me (with the minimal options enabled)but
> always the same result: computer frozen


It's been a long time since KVM caused host lockups. That is almost always 
something hardware/bios/local config related.


> 
> An interesting point: with Sun VirtualBox 3.1, the same frozen result.
> 
> My laptop is a TOSHIBA TECRA S4 (europe model only).
> 
> magicb...@linux-ue9l:~/> cat /proc/cpuinfo



> 
> Anyone can help me?
> 
> Thx in advance.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][VHOST] fix race with guest on multi-buffer used buffer updates

2010-05-20 Thread David L Stevens
[for Michael Tsirkin's vhost development git tree]

This patch fixes a race between guest and host when
adding used buffers wraps the ring. Without it, guests
can see partial packets before num_buffers is set in
the vnet header.

Signed-off-by: David L Stevens 

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7f2568d..74790ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1065,14 +1065,6 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
vq_err(vq, "Failed to write used");
return -EFAULT;
}
-   /* Make sure buffer is written before we update index. */
-   smp_wmb();
-   if (put_user(vq->last_used_idx + count, &vq->used->idx)) {
-   vq_err(vq, "Failed to increment used idx");
-   return -EFAULT;
-   }
-   if (unlikely(vq->log_used))
-   vhost_log_used(vq, used);
vq->last_used_idx += count;
return 0;
 }
@@ -1093,7 +1085,17 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vring_used_elem *heads,
heads += n;
count -= n;
}
-   return __vhost_add_used_n(vq, heads, count);
+   r = __vhost_add_used_n(vq, heads, count);
+
+   /* Make sure buffer is written before we update index. */
+   smp_wmb();
+   if (put_user(vq->last_used_idx, &vq->used->idx)) {
+   vq_err(vq, "Failed to increment used idx");
+   return -EFAULT;
+   }
+   if (unlikely(vq->log_used))
+   vhost_log_used(vq, vq->used->ring + start);
+   return r;
 }
 
 /* This actually signals the guest, using eventfd. */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Print a user-friendly message on failed vmentry

2010-05-20 Thread Mohammed Gamal
On Thu, May 20, 2010 at 6:53 PM, Avi Kivity  wrote:
> On 05/20/2010 05:46 PM, Mohammed Gamal wrote:
>>
>> On Thu, May 20, 2010 at 5:37 PM, Chris Lalancette
>>  wrote:
>>
>>>
>>> On 05/19/2010 05:16 PM, Mohammed Gamal wrote:
>>>

 This patch address bug report in
 https://bugs.launchpad.net/qemu/+bug/530077.

 Failed vmentries were handled with handle_unhandled() which prints a
 rather
 unfriendly message to the user. This patch separates handling vmentry
 failures
 from unknown exit reasons and prints a friendly message to the user.

 Signed-off-by: Mohammed Gamal
 ---
  qemu-kvm.c |   16 +++-
  1 files changed, 15 insertions(+), 1 deletions(-)

 diff --git a/qemu-kvm.c b/qemu-kvm.c
 index 35a4c8a..deb4df8 100644
 --- a/qemu-kvm.c
 +++ b/qemu-kvm.c
 @@ -106,6 +106,20 @@ static int handle_unhandled(uint64_t reason)
      return -EINVAL;
  }

 +static int handle_failed_vmentry(uint64_t reason)
 +{
 +    fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64
 "\n\n", reason);
 +    fprintf(stderr, "If you're runnning a guest on an Intel machine, it
 can be\n");
 +    fprintf(stderr, "most-likely due to the guest going into an invalid
 state\n");
 +    fprintf(stderr, "for Intel VT. For example, the guest maybe running
 in big\n");
 +    fprintf(stderr, "real mode which is not supported by Intel
 VT.\n\n");
 +    fprintf(stderr, "You may want to try enabling real mode emulation
 in KVM.\n");
 +    fprintf(stderr, "To Enable it, you may run the following commands
 as root:\n");
 +    fprintf(stderr, "# rmmod kvm_intel\n");
 +    fprintf(stderr, "# rmmod kvm\n");
 +    fprintf(stderr, "# modprobe kvm_intel
 emulate_invalid_guest_state=1\n");
 +    return -EINVAL;
 +}

>>>
>>> The thing is, there are other valid reasons for vmentry failure.  A while
>>> ago I tracked
>>> down a bug in the Linux kernel that was causing us to vmenter with
>>> invalid segments;
>>> this message would have been very misleading in that case.  I think you'd
>>> have to do
>>> more complete analysis of the vmentry failure code to be more certain
>>> about the reason
>>> for failure.
>>>
>>>
>>
>> Your point is definitely valid, yet big real mode is usually the most
>> likely case, and that's why this message is shown. Note also that it
>> says it's _most likely_ a failure caused by an invalid guest state,
>> but it doesn't rule out all other reasons. And in any case, it'd be
>> better than just printing something along the lines of:
>> " kvm: unhandled exit 8021
>>   kvm_run returned -22"
>>
>
> However, we're still giving bad advice.  Currently
> emulate_invalid_guest_state=1 is not going to work well (right?).  Once it
> does, we'll simply make it the default and the message will never appear.
>
I already added a warning in the second patch I sent.

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to
> panic.
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Print a user-friendly message on failed vmentry

2010-05-20 Thread Avi Kivity

On 05/20/2010 05:46 PM, Mohammed Gamal wrote:

On Thu, May 20, 2010 at 5:37 PM, Chris Lalancette  wrote:
   

On 05/19/2010 05:16 PM, Mohammed Gamal wrote:
 

This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.

Failed vmentries were handled with handle_unhandled() which prints a rather
unfriendly message to the user. This patch separates handling vmentry failures
from unknown exit reasons and prints a friendly message to the user.

Signed-off-by: Mohammed Gamal
---
  qemu-kvm.c |   16 +++-
  1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 35a4c8a..deb4df8 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -106,6 +106,20 @@ static int handle_unhandled(uint64_t reason)
  return -EINVAL;
  }

+static int handle_failed_vmentry(uint64_t reason)
+{
+fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64 "\n\n", 
reason);
+fprintf(stderr, "If you're runnning a guest on an Intel machine, it can 
be\n");
+fprintf(stderr, "most-likely due to the guest going into an invalid 
state\n");
+fprintf(stderr, "for Intel VT. For example, the guest maybe running in 
big\n");
+fprintf(stderr, "real mode which is not supported by Intel VT.\n\n");
+fprintf(stderr, "You may want to try enabling real mode emulation in 
KVM.\n");
+fprintf(stderr, "To Enable it, you may run the following commands as 
root:\n");
+fprintf(stderr, "# rmmod kvm_intel\n");
+fprintf(stderr, "# rmmod kvm\n");
+fprintf(stderr, "# modprobe kvm_intel emulate_invalid_guest_state=1\n");
+return -EINVAL;
+}
   

The thing is, there are other valid reasons for vmentry failure.  A while ago I 
tracked
down a bug in the Linux kernel that was causing us to vmenter with invalid 
segments;
this message would have been very misleading in that case.  I think you'd have 
to do
more complete analysis of the vmentry failure code to be more certain about the 
reason
for failure.

 

Your point is definitely valid, yet big real mode is usually the most
likely case, and that's why this message is shown. Note also that it
says it's _most likely_ a failure caused by an invalid guest state,
but it doesn't rule out all other reasons. And in any case, it'd be
better than just printing something along the lines of:
" kvm: unhandled exit 8021
   kvm_run returned -22"
   


However, we're still giving bad advice.  Currently 
emulate_invalid_guest_state=1 is not going to work well (right?).  Once 
it does, we'll simply make it the default and the message will never appear.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 05:34 PM, Rusty Russell wrote:



Have just one ring, no indexes.  The producer places descriptors into
the ring and updates the head,  The consumer copies out descriptors to
be processed and copies back in completed descriptors.  Chaining is
always linear.  The descriptors contain a tag that allow the producer to
identify the completion.
 

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.
   


AFAICS having separate avail_ring/used_ring/desc_pool is orthogonal to 
this cuteness.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   

Interesting!  So a peer will read until it hits a wall.  But how to
update the wall atomically?

Maybe we can have a flag in the descriptor indicate headness or
tailness.  Update looks ugly though: write descriptor with head flag,
write next descriptor with head flag, remove flag from previous descriptor.
 

I was thinking a separate magic "invalid" entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.
   


Worth exploring. This amortizes the indexes into the ring, a good thing.

Another thing we can do is place the tail a half ring away from the head 
(and limit ring utilization to 50%), reducing bounces on short kicks.  
Or equivalently have an avail ring and used ring, but both containing 
tagged descriptors instead of pointers to descriptors.



I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...
   


Yup.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Print a user-friendly message on failed vmentry

2010-05-20 Thread Mohammed Gamal
On Thu, May 20, 2010 at 5:37 PM, Chris Lalancette  wrote:
> On 05/19/2010 05:16 PM, Mohammed Gamal wrote:
>> This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.
>>
>> Failed vmentries were handled with handle_unhandled() which prints a rather
>> unfriendly message to the user. This patch separates handling vmentry 
>> failures
>> from unknown exit reasons and prints a friendly message to the user.
>>
>> Signed-off-by: Mohammed Gamal 
>> ---
>>  qemu-kvm.c |   16 +++-
>>  1 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/qemu-kvm.c b/qemu-kvm.c
>> index 35a4c8a..deb4df8 100644
>> --- a/qemu-kvm.c
>> +++ b/qemu-kvm.c
>> @@ -106,6 +106,20 @@ static int handle_unhandled(uint64_t reason)
>>      return -EINVAL;
>>  }
>>
>> +static int handle_failed_vmentry(uint64_t reason)
>> +{
>> +    fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64 "\n\n", 
>> reason);
>> +    fprintf(stderr, "If you're runnning a guest on an Intel machine, it can 
>> be\n");
>> +    fprintf(stderr, "most-likely due to the guest going into an invalid 
>> state\n");
>> +    fprintf(stderr, "for Intel VT. For example, the guest maybe running in 
>> big\n");
>> +    fprintf(stderr, "real mode which is not supported by Intel VT.\n\n");
>> +    fprintf(stderr, "You may want to try enabling real mode emulation in 
>> KVM.\n");
>> +    fprintf(stderr, "To Enable it, you may run the following commands as 
>> root:\n");
>> +    fprintf(stderr, "# rmmod kvm_intel\n");
>> +    fprintf(stderr, "# rmmod kvm\n");
>> +    fprintf(stderr, "# modprobe kvm_intel emulate_invalid_guest_state=1\n");
>> +    return -EINVAL;
>> +}
>
> The thing is, there are other valid reasons for vmentry failure.  A while ago 
> I tracked
> down a bug in the Linux kernel that was causing us to vmenter with invalid 
> segments;
> this message would have been very misleading in that case.  I think you'd 
> have to do
> more complete analysis of the vmentry failure code to be more certain about 
> the reason
> for failure.
>
Your point is definitely valid, yet big real mode is usually the most
likely case, and that's why this message is shown. Note also that it
says it's _most likely_ a failure caused by an invalid guest state,
but it doesn't rule out all other reasons. And in any case, it'd be
better than just printing something along the lines of:
" kvm: unhandled exit 8021
  kvm_run returned -22"

> --
> Chris Lalancette
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Print a user-friendly message on failed vmentry

2010-05-20 Thread Anthony Liguori

On 05/20/2010 09:37 AM, Chris Lalancette wrote:

On 05/19/2010 05:16 PM, Mohammed Gamal wrote:
   

This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.

Failed vmentries were handled with handle_unhandled() which prints a rather
unfriendly message to the user. This patch separates handling vmentry failures
from unknown exit reasons and prints a friendly message to the user.

Signed-off-by: Mohammed Gamal
---
  qemu-kvm.c |   16 +++-
  1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 35a4c8a..deb4df8 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -106,6 +106,20 @@ static int handle_unhandled(uint64_t reason)
  return -EINVAL;
  }

+static int handle_failed_vmentry(uint64_t reason)
+{
+fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64 "\n\n", 
reason);
+fprintf(stderr, "If you're runnning a guest on an Intel machine, it can 
be\n");
+fprintf(stderr, "most-likely due to the guest going into an invalid 
state\n");
+fprintf(stderr, "for Intel VT. For example, the guest maybe running in 
big\n");
+fprintf(stderr, "real mode which is not supported by Intel VT.\n\n");
+fprintf(stderr, "You may want to try enabling real mode emulation in 
KVM.\n");
+fprintf(stderr, "To Enable it, you may run the following commands as 
root:\n");
+fprintf(stderr, "# rmmod kvm_intel\n");
+fprintf(stderr, "# rmmod kvm\n");
+fprintf(stderr, "# modprobe kvm_intel emulate_invalid_guest_state=1\n");
+return -EINVAL;
+}
 

The thing is, there are other valid reasons for vmentry failure.  A while ago I 
tracked
down a bug in the Linux kernel that was causing us to vmenter with invalid 
segments;
this message would have been very misleading in that case.  I think you'd have 
to do
more complete analysis of the vmentry failure code to be more certain about the 
reason
for failure.
   


We should probably only display this message if reason == 0x8021.  
It may be worth looking at the cpu_state to verify that we're trying to 
enter 16-bit mode too.


Then the message can be much more definitive too.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Print a user-friendly message on failed vmentry

2010-05-20 Thread Chris Lalancette
On 05/19/2010 05:16 PM, Mohammed Gamal wrote:
> This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.
> 
> Failed vmentries were handled with handle_unhandled() which prints a rather
> unfriendly message to the user. This patch separates handling vmentry failures
> from unknown exit reasons and prints a friendly message to the user.
> 
> Signed-off-by: Mohammed Gamal 
> ---
>  qemu-kvm.c |   16 +++-
>  1 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/qemu-kvm.c b/qemu-kvm.c
> index 35a4c8a..deb4df8 100644
> --- a/qemu-kvm.c
> +++ b/qemu-kvm.c
> @@ -106,6 +106,20 @@ static int handle_unhandled(uint64_t reason)
>  return -EINVAL;
>  }
>  
> +static int handle_failed_vmentry(uint64_t reason)
> +{
> +fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64 "\n\n", 
> reason);
> +fprintf(stderr, "If you're runnning a guest on an Intel machine, it can 
> be\n");
> +fprintf(stderr, "most-likely due to the guest going into an invalid 
> state\n");
> +fprintf(stderr, "for Intel VT. For example, the guest maybe running in 
> big\n");
> +fprintf(stderr, "real mode which is not supported by Intel VT.\n\n");
> +fprintf(stderr, "You may want to try enabling real mode emulation in 
> KVM.\n");
> +fprintf(stderr, "To Enable it, you may run the following commands as 
> root:\n");
> +fprintf(stderr, "# rmmod kvm_intel\n");
> +fprintf(stderr, "# rmmod kvm\n");
> +fprintf(stderr, "# modprobe kvm_intel emulate_invalid_guest_state=1\n");
> +return -EINVAL;
> +}

The thing is, there are other valid reasons for vmentry failure.  A while ago I 
tracked
down a bug in the Linux kernel that was causing us to vmenter with invalid 
segments;
this message would have been very misleading in that case.  I think you'd have 
to do
more complete analysis of the vmentry failure code to be more certain about the 
reason
for failure.

-- 
Chris Lalancette
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Rusty Russell
On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote:
> On 05/20/2010 08:01 AM, Rusty Russell wrote:
> >
> >> A device with out of order
> >> completion (like virtio-blk) will quickly randomize the unused
> >> descriptor indexes, so every descriptor fetch will require a bounce.
> >>
> >> In contrast, if the rings hold the descriptors themselves instead of
> >> pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
> >> every descriptor, amortized.
> >>  
> > We already have indirect, this would be a logical next step.  So let's
> > think about it. The avail ring would contain 64 bit values, the used ring
> > would contain indexes into the avail ring.
> 
> Have just one ring, no indexes.  The producer places descriptors into 
> the ring and updates the head,  The consumer copies out descriptors to 
> be processed and copies back in completed descriptors.  Chaining is 
> always linear.  The descriptors contain a tag that allow the producer to 
> identify the completion.

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.

> > Can we do better?  The obvious idea is to try to get rid of last_used and
> > used, and use the ring itself.  We would use an invalid entry to mark the
> > head of the ring.
> 
> Interesting!  So a peer will read until it hits a wall.  But how to 
> update the wall atomically?
> 
> Maybe we can have a flag in the descriptor indicate headness or 
> tailness.  Update looks ugly though: write descriptor with head flag, 
> write next descriptor with head flag, remove flag from previous descriptor.

I was thinking a separate magic "invalid" entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.

I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] AMD IOMMU emulation

2010-05-20 Thread Eduard - Gabriel Munteanu
This is preliminary work for AMD IOMMU emulation support.

Signed-off-by: Eduard - Gabriel Munteanu 
---
 Makefile.target |2 +
 configure   |9 +
 hw/amd_iommu.c  |  442 +++
 hw/pc.c |2 +
 hw/pc.h |3 +
 hw/pci_ids.h|2 +
 hw/pci_regs.h   |1 +
 7 files changed, 461 insertions(+), 0 deletions(-)
 create mode 100644 hw/amd_iommu.c

diff --git a/Makefile.target b/Makefile.target
index 0bdb184..13f8086 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -217,6 +217,8 @@ obj-i386-y += testdev.o
 obj-i386-$(CONFIG_KVM_PIT) += i8254-kvm.o
 obj-i386-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += device-assignment.o
 
+obj-i386-$(CONFIG_AMD_IOMMU) += amd_iommu.o
+
 # Hardware support
 obj-ia64-y += ide.o pckbd.o vga.o $(SOUND_HW) dma.o $(AUDIODRV)
 obj-ia64-y += fdc.o mc146818rtc.o serial.o i8259.o ipf.o
diff --git a/configure b/configure
index ed8e17b..34e5194 100755
--- a/configure
+++ b/configure
@@ -305,6 +305,7 @@ mixemu="no"
 kvm_trace="no"
 kvm_cap_pit=""
 kvm_cap_device_assignment=""
+amd_iommu="no"
 kerneldir=""
 aix="no"
 blobs="yes"
@@ -603,6 +604,8 @@ for opt do
   ;;
   --enable-kvm-device-assignment) kvm_cap_device_assignment="yes"
   ;;
+  --enable-amd-iommu-emul) amd_iommu="yes"
+  ;;
   --enable-profiler) profiler="yes"
   ;;
   --enable-cocoa)
@@ -829,6 +832,8 @@ echo "  --disable-kvm-pitdisable KVM pit support"
 echo "  --enable-kvm-pit enable KVM pit support"
 echo "  --disable-kvm-device-assignment  disable KVM device assignment support"
 echo "  --enable-kvm-device-assignment   enable KVM device assignment support"
+echo "  --disable-amd-iommu-emul disable AMD IOMMU emulation"
+echo "  --enable-amd-iommu-emul  enable AMD IOMMU emulation"
 echo "  --disable-nptl   disable usermode NPTL support"
 echo "  --enable-nptlenable usermode NPTL support"
 echo "  --enable-system  enable all system emulation targets"
@@ -2185,6 +2190,7 @@ echo "KVM support   $kvm"
 echo "KVM PIT support   $kvm_cap_pit"
 echo "KVM device assig. $kvm_cap_device_assignment"
 echo "KVM trace support $kvm_trace"
+echo "AMD IOMMU emul.   $amd_iommu"
 echo "fdt support   $fdt"
 echo "preadv support$preadv"
 echo "fdatasync $fdatasync"
@@ -2599,6 +2605,9 @@ case "$target_arch2" in
   x86_64)
 TARGET_BASE_ARCH=i386
 target_phys_bits=64
+if test "$amd_iommu" = "yes"; then
+  echo "CONFIG_AMD_IOMMU=y" >> $config_target_mak
+fi
   ;;
   ia64)
 target_phys_bits=64
diff --git a/hw/amd_iommu.c b/hw/amd_iommu.c
new file mode 100644
index 000..cde90d0
--- /dev/null
+++ b/hw/amd_iommu.c
@@ -0,0 +1,442 @@
+/*
+ * AMD IOMMU emulation
+ *
+ * Copyright (c) 2010 Eduard - Gabriel Munteanu
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "pc.h"
+#include "hw.h"
+#include "pci.h"
+
+/* Capability registers */
+#define CAPAB_HEADER0x00
+#define   CAPAB_REV_TYPE0x02
+#define   CAPAB_FLAGS   0x03
+#define CAPAB_BAR_LOW   0x04
+#define CAPAB_BAR_HIGH  0x08
+#define CAPAB_RANGE 0x0C
+#define CAPAB_MISC  0x10
+
+#define CAPAB_SIZE  0x14
+
+/* Capability header data */
+#define CAPAB_FLAG_IOTLBSUP (1 << 0)
+#define CAPAB_FLAG_HTTUNNEL (1 << 1)
+#define CAPAB_FLAG_NPCACHE  (1 << 2)
+#define CAPAB_INIT_REV  (1 << 3)
+#define CAPAB_INIT_TYPE 3
+#define CAPAB_INIT_REV_TYPE (CAPAB_REV | CAPAB_TYPE)
+#define CAPAB_INIT_FLAGS(CAPAB_FLAG_NPCACHE | CAPAB_FLAG_HTTUNNEL)
+#define CAPAB_INIT_MISC (64 << 15) | (48 << 8)
+#define CAPAB_BAR_MASK  ~((1UL << 14) - 1)
+
+/* MMIO registers */
+#define MMIO_DEVICE_TABLE   0x
+#define MMIO_COMMAND_BASE   0x0008
+#define MMIO_EVENT_BASE 0x0010
+#define MMIO_CONTROL0x0018
+#define MMIO_EXCL_BASE  0x0020
+#define MMIO_EXCL_LIMIT   

Re: [Qemu-devel] [PATCH 0/2] Fix scsi-generic breakage in upstream qemu-kvm.git

2010-05-20 Thread Kevin Wolf
Am 17.05.2010 18:45, schrieb Nicholas A. Bellinger:
> From: Nicholas Bellinger 
> 
> Greetings,
> 
> Attached are the updated patches following hch's comments to fix scsi-generic
> device breakage with find_image_format() and refresh_total_sectors().
> 
> These are being resent as the last attachments where in MBOX format from 
> git-format-patch.
> 
> Signed-off-by: Nicholas A. Bellinger 

Thanks, applied all to the block branch, even though I forgot to reply here.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Avi Kivity

On 05/20/2010 03:24 PM, Stefan Hajnoczi wrote:

On Thu, May 20, 2010 at 1:14 PM, Avi Kivity  wrote:
   

echo 1>/sys/kernel/debug/tracing/events/kvm/enable
cat /sys/kernel/debug/tracing/trace_pipe>results/trace&
   

perf will enable the events by itself (no?), so all you need is is the perf
call in the middle.
 

Yes, it will enable events.  However, I am also generating a
kvm_trace-like log using trace_pipe.  On this box I couldn't get perf
trace working so I used trace_pipe for the kvm_trace-equivalent and
perf stat for the kvm_stat equivalent.
   


Ah.  There's the new shiny trace_cmd as well.

If you're getting a (binary) trace, runing perf stat is redundant since 
it contains the same data.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Stefan Hajnoczi
On Thu, May 20, 2010 at 1:14 PM, Avi Kivity  wrote:
>> echo 1>/sys/kernel/debug/tracing/events/kvm/enable
>> cat /sys/kernel/debug/tracing/trace_pipe>results/trace&
>
> perf will enable the events by itself (no?), so all you need is is the perf
> call in the middle.

Yes, it will enable events.  However, I am also generating a
kvm_trace-like log using trace_pipe.  On this box I couldn't get perf
trace working so I used trace_pipe for the kvm_trace-equivalent and
perf stat for the kvm_stat equivalent.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Avi Kivity

On 05/20/2010 02:23 PM, Stefan Hajnoczi wrote:

On Thu, May 20, 2010 at 12:16 PM, Jes Sorensen  wrote:
   

On 05/20/10 13:10, Avi Kivity wrote:
 

What's wrong with starting perf after the warm-up period and stopping it
before it's done?
   

It's pretty hard to script.
 

I use the following.  It ain't pretty:

#!/bin/bash
cleanup() {
 trap - 2
 kill -2 $sleep_pid
 echo 0>/sys/kernel/debug/tracing/events/kvm/enable
 kill $cat_pid
}

perf stat -a -e 'kvm:*' sleep 1h>results/perf_stat 2>&1&
sleep_pid=$(sleep 1&&  pgrep -x -f "sleep 1h")   # sleep 1 is to avoid
race with forked perf process
trap cleanup 2
echo 1>/sys/kernel/debug/tracing/events/kvm/enable
cat /sys/kernel/debug/tracing/trace_pipe>results/trace&
cat_pid=$!

# ...do stuff here...

cleanup
   


perf will enable the events by itself (no?), so all you need is is the 
perf call in the middle.


What's missing is vmstat-like or kvm_stat-like output, but that's 
another thing.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] device-assignment: add config fd qdev property

2010-05-20 Thread Markus Armbruster
Anthony Liguori  writes:

> On 05/19/2010 02:00 PM, Chris Wright wrote:
>> When libvirt launches a guest it first chowns the relevenat
>> /sys/bus/pci/.../config file for an assigned device then drops privileges.
>>
>> This causes an issue for device assignment because despite being file
>> owner, the sysfs config space file checks for CAP_SYS_ADMIN before
>> allowing access to device dependent config space.
>>
>> This adds a new qdev configfd property which allows libvirt to open the
>> sysfs config space file and give qemu an already opened file descriptor.
>> Along with a change pending for the 2.6.35 kernel, this allows the
>> capability check to compare against privileges from when the file was
>> opened.
>>
>> Signed-off-by: Chris Wright
>>
>
> An fd as a qdev property seems like a bad idea to me.  I'm not sure I
> have a better suggestion though.

Shot from the hip without much thought: could we use monitor command
getfd?  That associates the fd with a name.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Stefan Hajnoczi
On Thu, May 20, 2010 at 12:16 PM, Jes Sorensen  wrote:
> On 05/20/10 13:10, Avi Kivity wrote:
>> What's wrong with starting perf after the warm-up period and stopping it
>> before it's done?
>
> It's pretty hard to script.

I use the following.  It ain't pretty:

#!/bin/bash
cleanup() {
trap - 2
kill -2 $sleep_pid
echo 0 >/sys/kernel/debug/tracing/events/kvm/enable
kill $cat_pid
}

perf stat -a -e 'kvm:*' sleep 1h >results/perf_stat 2>&1 &
sleep_pid=$(sleep 1 && pgrep -x -f "sleep 1h")   # sleep 1 is to avoid
race with forked perf process
trap cleanup 2
echo 1 >/sys/kernel/debug/tracing/events/kvm/enable
cat /sys/kernel/debug/tracing/trace_pipe >results/trace &
cat_pid=$!

# ...do stuff here...

cleanup

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Jes Sorensen
On 05/20/10 13:10, Avi Kivity wrote:
> On 05/20/2010 02:05 PM, Stefan Hajnoczi wrote:
>> Jes, you're right, something like "perf stat -e kvm:* --start" and
>> "perf stat --stop" would be more usable for system-wide monitoring.  I
>> wonder if it is possible to support this or whether the perf process
>> needs to periodically accumulate the counters (i.e. babysit the kernel
>> infrastructure)?
>>
> 
> perf needs to be running to pull data out of the kernel (and since
> profiling is tied to an fd life cycle).
> 
> What's wrong with starting perf after the warm-up period and stopping it
> before it's done?

It's pretty hard to script.

I have scripts that do 'read_stats() ; run_test() ; read_stats() ;
calc_average()'

Autotest would like to do this kinda stuff too.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Avi Kivity

On 05/20/2010 02:05 PM, Stefan Hajnoczi wrote:

8330  kvm:kvm_entry#  0.000 M/sec
^--- count since starting perf

The 8330 number means that kvm_entry has fired 8330 times since perf
was started.  Like Avi says, you need to keep the perf process
running.  I run benchmarks using a script that kills perf after the
benchmark completes.

Jes, you're right, something like "perf stat -e kvm:* --start" and
"perf stat --stop" would be more usable for system-wide monitoring.  I
wonder if it is possible to support this or whether the perf process
needs to periodically accumulate the counters (i.e. babysit the kernel
infrastructure)?
   


perf needs to be running to pull data out of the kernel (and since 
profiling is tied to an fd life cycle).


What's wrong with starting perf after the warm-up period and stopping it 
before it's done?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Stefan Hajnoczi
8330  kvm:kvm_entry#  0.000 M/sec
^--- count since starting perf

The 8330 number means that kvm_entry has fired 8330 times since perf
was started.  Like Avi says, you need to keep the perf process
running.  I run benchmarks using a script that kills perf after the
benchmark completes.

Jes, you're right, something like "perf stat -e kvm:* --start" and
"perf stat --stop" would be more usable for system-wide monitoring.  I
wonder if it is possible to support this or whether the perf process
needs to periodically accumulate the counters (i.e. babysit the kernel
infrastructure)?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST PATCH] KVM test: use command line option wrapper functions

2010-05-20 Thread Lucas Meneghel Rodrigues
On Thu, 2010-05-20 at 12:50 +0300, Michael Goldish wrote:
> On 05/19/2010 11:25 AM, Feng Yang wrote:
> > Hi, Michael
> > 
> > Thanks for your patch.
> > We plan add "netdev" parameter support in make_qemu_command.  Since you are 
> > working on this part. Could you add netdev support in your patch? hopeful 
> > netdev can be default supported in make_qemu_command if qemu support it. 
> > Thanks very much!
> 
> Sure, I'll look into it.
> 
> > I think the point of this patch is good and we need this kinds of patch.
> > But I think we need not add so many new function.  Especially some function 
> > only directly return the string and do nothing more.
> > This will increase the function call consumption.
> > 
> All these helper functions are meant to be extended and modified in the
> future.  They're only there to minimize future effort involved in adding
> support for new command line syntaxes.
> Right now add_smp() just returns " -smp %s", but in the future we may
> have to support different syntaxes for -smp, and then add_smp() will
> consult the output of 'qemu -help' and return the proper string.
> What do you mean by function call consumption?  I don't think these
> functions cause a measurable slowdown, and make_qemu_command() is called
> very few times, so this really isn't a concern IMO.

Agreed, the wrappers are a good strategy in the case we have to support
different feature sets and syntax.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] correctly trace irq injection on SVM.

2010-05-20 Thread Gleb Natapov
On SVM interrupts are injected by svm_set_irq() not svm_inject_irq().
The later is used only to wait for irq window.

Signed-off-by: Gleb Natapov 

ChangeLog:
 v1->v2:
  - fix stupid cut&paste error.

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 58c91f5..69b16a7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2831,8 +2831,6 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
int irq)
 {
struct vmcb_control_area *control;
 
-   trace_kvm_inj_virq(irq);
-
++svm->vcpu.stat.irq_injections;
control = &svm->vmcb->control;
control->int_vector = irq;
@@ -2847,6 +2845,8 @@ static void svm_set_irq(struct kvm_vcpu *vcpu)
 
BUG_ON(!(gif_set(svm)));
 
+   trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
+
svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
 }
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] correctly trace irq injection on SVM.

2010-05-20 Thread Gleb Natapov
On Thu, May 20, 2010 at 01:19:45PM +0300, Gleb Natapov wrote:
> On SVM interrupts are injected by svm_set_irq() not svm_inject_irq().
> The later is used only to wait for irq window.
> 
And of course ignore this. Haven't noticed compilation error.

> Signed-off-by: Gleb Natapov 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 58c91f5..cd70109 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2831,8 +2831,6 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
> int irq)
>  {
>   struct vmcb_control_area *control;
>  
> - trace_kvm_inj_virq(irq);
> -
>   ++svm->vcpu.stat.irq_injections;
>   control = &svm->vmcb->control;
>   control->int_vector = irq;
> @@ -2847,6 +2845,8 @@ static void svm_set_irq(struct kvm_vcpu *vcpu)
>  
>   BUG_ON(!(gif_set(svm)));
>  
> + trace_kvm_inj_virq(irq);
> +
>   svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
>   SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
>  }
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] correctly trace irq injection on SVM.

2010-05-20 Thread Gleb Natapov
On SVM interrupts are injected by svm_set_irq() not svm_inject_irq().
The later is used only to wait for irq window.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 58c91f5..cd70109 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2831,8 +2831,6 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
int irq)
 {
struct vmcb_control_area *control;
 
-   trace_kvm_inj_virq(irq);
-
++svm->vcpu.stat.irq_injections;
control = &svm->vmcb->control;
control->int_vector = irq;
@@ -2847,6 +2845,8 @@ static void svm_set_irq(struct kvm_vcpu *vcpu)
 
BUG_ON(!(gif_set(svm)));
 
+   trace_kvm_inj_virq(irq);
+
svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
 }
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix "info cpus" halted state display

2010-05-20 Thread Gleb Natapov
On Thu, May 13, 2010 at 04:17:14PM +0300, Gleb Natapov wrote:
> When in-kernel irqchip is used env->halted is never used for anything
> except "info cpus" command. Halted state is synced in
> kvm_arch_save_mpstate() and showed by do_info_cpus() but otherwise never
> looked at. Zeroing it here breaks "info cpus" since before
> do_info_cpus() outputs env->halted in io thread it is zeroed here when
> vcpu thread reenters kernel.
> 
Avi, what about this patch?

> Signed-off-by: Gleb Natapov 
> diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
> index 61d9331..0ec2881 100644
> --- a/qemu-kvm-x86.c
> +++ b/qemu-kvm-x86.c
> @@ -922,10 +922,6 @@ void kvm_arch_load_regs(CPUState *env, int level)
>  if (env->kvm_vcpu_update_vapic)
>  kvm_tpr_enable_vapic(env);
>  }
> -if (kvm_irqchip_in_kernel()) {
> -/* Avoid deadlock: no user space IRQ will ever clear it. */
> -env->halted = 0;
> -}
>  
>  kvm_put_vcpu_events(env, level);
>  kvm_put_debugregs(env);
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:31:50PM +0930, Rusty Russell wrote:
> Can we do better?  The obvious idea is to try to get rid of last_used and
> used, and use the ring itself.  We would use an invalid entry to mark the
> head of the ring.
> 
> Any other thoughts?
> Rusty.

We also need a way to avoid interrupts at least while we are
processing the ring.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] device-assignment: add config fd qdev property

2010-05-20 Thread Daniel P. Berrange
On Wed, May 19, 2010 at 02:18:35PM -0500, Anthony Liguori wrote:
> On 05/19/2010 02:00 PM, Chris Wright wrote:
> >When libvirt launches a guest it first chowns the relevenat
> >/sys/bus/pci/.../config file for an assigned device then drops privileges.
> >
> >This causes an issue for device assignment because despite being file
> >owner, the sysfs config space file checks for CAP_SYS_ADMIN before
> >allowing access to device dependent config space.
> >
> >This adds a new qdev configfd property which allows libvirt to open the
> >sysfs config space file and give qemu an already opened file descriptor.
> >Along with a change pending for the 2.6.35 kernel, this allows the
> >capability check to compare against privileges from when the file was
> >opened.
> >
> >Signed-off-by: Chris Wright
> >   
> 
> An fd as a qdev property seems like a bad idea to me.  I'm not sure I 
> have a better suggestion though.

The entire requirement to pass the open FD to qemu is a bad idea, but
the kernel makes it impossible to do otherwise :-( Personally I wish the
kernel just used the file ownership, so we could chown() the sysfs file 
to 'qemu' user in a normal manner :-(

Daniel
-- 
|: Red Hat, Engineering, London-o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org-o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [v3 PATCH] KVM test: Add a helper to search the panic in the log

2010-05-20 Thread Michael Goldish
On 05/19/2010 12:13 PM, Jason Wang wrote:
> This checker serves as the post_command to find the panic information
> in the file which contains the content of guest serial console.
> 
> Changes from v2:
> - Put all things into __main__
> - Fix some typos
> 
> Signed-off-by: Jason Wang 
> ---
>  client/tests/kvm/scripts/check_serial.py |   24 
>  client/tests/kvm/tests_base.cfg.sample   |7 +--
>  2 files changed, 29 insertions(+), 2 deletions(-)
>  create mode 100644 client/tests/kvm/scripts/check_serial.py
> 
> diff --git a/client/tests/kvm/scripts/check_serial.py 
> b/client/tests/kvm/scripts/check_serial.py
> new file mode 100644
> index 000..6432c27
> --- /dev/null
> +++ b/client/tests/kvm/scripts/check_serial.py
> @@ -0,0 +1,24 @@
> +import os, sys, glob, re
> +
> +
> +class SerialCheckerError(Exception):
> +"""
> +Simple wrapper for the builtin Exception class.
> +"""
> +pass
> +
> +
> +if __name__ == "__main__":
> +client_dir =  os.environ['AUTODIR']
> +pattern = os.environ['KVM_TEST_search_pattern']
> +shortname = os.environ['KVM_TEST_shortname']
> +debugdir = os.path.join(client_dir, "results/default/kvm.%s/debug" 
> +% shortname)
> +serial_files = glob.glob(os.path.join(debugdir, 'serial*'))
> +
> +fail = [ f for f in serial_files if
> + re.findall(pattern, file(f).read(), re.I) ]
> +if fail:
> +print "%s is found in %s" % (pattern, fail)
> +raise SerialCheckerError("Error found during the check, please " 
> + "check the log")
> diff --git a/client/tests/kvm/tests_base.cfg.sample 
> b/client/tests/kvm/tests_base.cfg.sample
> index e85bb4a..c4e522a 100644
> --- a/client/tests/kvm/tests_base.cfg.sample
> +++ b/client/tests/kvm/tests_base.cfg.sample
> @@ -52,6 +52,10 @@ address_index = 0
>  # Misc
>  profilers = kvm_stat
>  
> +# pattern to search in guest serial console
> +search_pattern = panic
> +post_command = "python scripts/check_serial.py"
> +post_command_noncritical = no
>  
>  # Tests
>  variants:
> @@ -1324,10 +1328,9 @@ virtio|virtio_blk|e1000|balloon_check:
>  variants:
>  - @qcow2:
>  image_format = qcow2
> -post_command = " python scripts/check_image.py;"
> +post_command += " && python scripts/check_image.py"

If post_command is empty before executing this line, it'll end up being
" && python scripts/..." and bash doesn't like that AFAIK.  I think this
line should be:

post_command += " python scripts/check_image.py;"

So if post_command is empty, it becomes " python ...;" which is OK, and
if it isn't empty, it becomes "previous command; python ...;" (assuming
the previous command ended with ';').

>  remove_image = no
>  post_command_timeout = 600
> -post_command_noncritical = yes
>  - vmdk:
>  only Fedora Ubuntu Windows
>  only smp2
> 
> ___
> Autotest mailing list
> autot...@test.kernel.org
> http://test.kernel.org/cgi-bin/mailman/listinfo/autotest

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST PATCH] KVM test: use command line option wrapper functions

2010-05-20 Thread Michael Goldish
On 05/19/2010 11:25 AM, Feng Yang wrote:
> Hi, Michael
> 
> Thanks for your patch.
> We plan add "netdev" parameter support in make_qemu_command.  Since you are 
> working on this part. Could you add netdev support in your patch? hopeful 
> netdev can be default supported in make_qemu_command if qemu support it. 
> Thanks very much!

Sure, I'll look into it.

> I think the point of this patch is good and we need this kinds of patch.
> But I think we need not add so many new function.  Especially some function 
> only directly return the string and do nothing more.
> This will increase the function call consumption.
> 
> 
> - "Michael Goldish"  wrote:
> 
>> From: "Michael Goldish" 
>> To: autot...@test.kernel.org, kvm@vger.kernel.org
>> Cc: "Michael Goldish" 
>> Sent: Monday, May 17, 2010 9:29:35 PM GMT +08:00 Beijing / Chongqing / Hong 
>> Kong / Urumqi
>> Subject: [KVM-AUTOTEST PATCH] KVM test: use command line option wrapper 
>> functions
>>
>> In order to support multiple versions of qemu which use different
>> command line
>> options or syntaxes, wrap all command line options in small helper
>> functions,
>> which append text to the command line according to the output of 'qemu
>> -help'.
>>
>> Signed-off-by: Michael Goldish 
>> ---
>>  client/tests/kvm/kvm_vm.py |  198
>> ++--
>>  1 files changed, 135 insertions(+), 63 deletions(-)
>>
>> diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
>> index 047505a..94bacdf 100755
>> --- a/client/tests/kvm/kvm_vm.py
>> +++ b/client/tests/kvm/kvm_vm.py
>> @@ -186,12 +186,100 @@ class VM:
>> nic_model -- string to pass as 'model' parameter for
>> this
>> NIC (e.g. e1000)
>>  """
>> -if name is None:
>> -name = self.name
>> -if params is None:
>> -params = self.params
>> -if root_dir is None:
>> -root_dir = self.root_dir
>> +# Helper function for command line option wrappers
>> +def has_option(help, option):
>> +return bool(re.search(r"^-%s(\s|$)" % option, help,
>> re.MULTILINE))
>> +
>> +# Wrappers for all supported qemu command line parameters.
>> +# This is meant to allow support for multiple qemu versions.
>> +# Each of these functions receives the output of 'qemu -help'
>> as a
>> +# parameter, and should add the requested command line
>> option
>> +# accordingly.
>> +
>> +def add_name(help, name):
>> +return " -name '%s'" % name
> 
> I think we need not add so many new function.  Especially some function only 
> directly return the string and do nothing more.
> This will increase the function call consumption.
> 
>> +
>> +def add_unix_socket_monitor(help, filename):
>> +return " -monitor unix:%s,server,nowait" % filename
> Same as above
>> +
>> +def add_mem(help, mem):
>> +return " -m %s" % mem
> Same as above
>> +
>> +def add_smp(help, smp):
>> +return " -smp %s" % smp
> Same as above.

All these helper functions are meant to be extended and modified in the
future.  They're only there to minimize future effort involved in adding
support for new command line syntaxes.
Right now add_smp() just returns " -smp %s", but in the future we may
have to support different syntaxes for -smp, and then add_smp() will
consult the output of 'qemu -help' and return the proper string.
What do you mean by function call consumption?  I don't think these
functions cause a measurable slowdown, and make_qemu_command() is called
very few times, so this really isn't a concern IMO.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest

2010-05-20 Thread Avi Kivity

On 05/20/2010 12:16 PM, Sheng Yang wrote:

From: Dexuan Cui

Enable XSAVE/XRSTORE for guest.

Change from V2:
Addressed comments from Avi.

Change from V1:

1. Use FPU API.
2. Fix CPUID issue.
3. Save/restore all possible guest xstate fields when switching. Because we
don't know which fields guest has already touched.


diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d08bb4a..3938bd1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -302,6 +302,7 @@ struct kvm_vcpu_arch {
} update_pte;

struct fpu guest_fpu;
+   u64 xcr0;

gva_t mmio_fault_cr2;
struct kvm_pio_request pio;
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 9e6779f..346ea66 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -266,6 +266,7 @@ enum vmcs_field {
  #define EXIT_REASON_EPT_VIOLATION   48
  #define EXIT_REASON_EPT_MISCONFIG   49
  #define EXIT_REASON_WBINVD54
+#define EXIT_REASON_XSETBV 55

  /*
   * Interruption-information format
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 99ae513..a63f206 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -36,6 +36,8 @@
  #include
  #include
  #include
+#include
+#include

  #include "trace.h"

@@ -247,6 +249,9 @@ static const u32 vmx_msr_index[] = {
  };
  #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)

+#define MERGE_TO_U64(low, high) \
+   (((low)&  -1u) | ((u64)((high)&  -1u)<<  32))
+
   


static inline u64 kvm_read_edx_eax(vcpu) in cache_regs.h



+static int handle_xsetbv(struct kvm_vcpu *vcpu)
+{
+   u64 new_bv = MERGE_TO_U64(kvm_register_read(vcpu, VCPU_REGS_RAX),
+   kvm_register_read(vcpu, VCPU_REGS_RDX));
+
+   if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0)
+   goto err;
+   if (vmx_get_cpl(vcpu) != 0)
+   goto err;
+   if (!(new_bv&  XSTATE_FP))
+   goto err;
+   if ((new_bv&  XSTATE_YMM)&&  !(new_bv&  XSTATE_SSE))
+   goto err;
   


What about a check against unknown bits?


+   vcpu->arch.xcr0 = new_bv;
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
+   skip_emulated_instruction(vcpu);
+   return 1;
+err:
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+}
+
  static int handle_apic_access(struct kvm_vcpu *vcpu)
  {
return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;


+static u64 host_xcr0;
   


__read_mostly.


+
+static void update_cpuid(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   if (!best)
+   return;
+
+   /* Update OSXSAVE bit */
+   if (cpu_has_xsave&&  best->function == 0x1) {
+   best->ecx&= ~(bit(X86_FEATURE_OSXSAVE));
+   if (kvm_read_cr4(vcpu)&  X86_CR4_OSXSAVE)
+   best->ecx |= bit(X86_FEATURE_OSXSAVE);
+   }
+}
   


Note: need to update after userspace writes cpuid as well.


+
  int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
  {
unsigned long old_cr4 = kvm_read_cr4(vcpu);
@@ -481,6 +513,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (cr4&  CR4_RESERVED_BITS)
return 1;

+   if (!guest_cpuid_has_xsave(vcpu)&&  (cr4&  X86_CR4_OSXSAVE))
+   return 1;
+
if (is_long_mode(vcpu)) {
if (!(cr4&  X86_CR4_PAE))
return 1;
@@ -497,6 +532,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if ((cr4 ^ old_cr4)&  pdptr_bits)
kvm_mmu_reset_context(vcpu);

+   if ((cr4 ^ old_cr4)&  X86_CR4_OSXSAVE)
+   update_cpuid(vcpu);
+
   


I think we need to reload the guest's xcr0 at this point.  
Alternatively, call vmx_load_host_state() to ensure the the next entry 
will reload it.



@@ -1931,7 +1964,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,

switch (function) {
case 0:
-   entry->eax = min(entry->eax, (u32)0xb);
+   entry->eax = min(entry->eax, (u32)0xd);
   


Do we need any special handling for leaf 0xc?


@@ -4567,6 +4616,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->prepare_guest_switch(vcpu);
if (vcpu->fpu_active)
kvm_load_guest_fpu(vcpu);
+   if (kvm_read_cr4(vcpu)&  X86_CR4_OSXSAVE)
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
   


Better done in vmx_save_host_state(), so we only do it on context 
switches or entries from userspace.


kvm_read_cr4_bits() is faster - doesn't need a vmcs_readl().



atomic_set(&vcpu->guest_mode, 1);
smp_wmb();
@@ -5118,6 +5169,10 @@ void fx_init(struct kvm_vcpu *vcpu)
fpu_alloc(&vcpu->arch.guest_fpu);
fpu_finit(&vcpu->arch.guest_fpu);

+   /* Ensure guest xcr0 is valid for loading */
+ 

Re: [Autotest] [PATCH] KVM Test: Make remote_scp() more robust.

2010-05-20 Thread Michael Goldish
On 05/19/2010 05:14 AM, Feng Yang wrote:
> 
> - "Michael Goldish"  wrote:
> 
>> From: "Michael Goldish" 
>> To: "Feng Yang" 
>> Cc: autot...@test.kernel.org, kvm@vger.kernel.org
>> Sent: Monday, May 17, 2010 11:05:37 PM GMT +08:00 Beijing / Chongqing / Hong 
>> Kong / Urumqi
>> Subject: Re: [Autotest] [PATCH] KVM Test: Make remote_scp() more robust.
>>
>> On 05/07/2010 01:26 PM, Feng Yang wrote:
>>> 1. In remote_scp(), if SCP connetion stalled for some reason,
>> following
>>> code will be ran.
>>> else:  # match == None
>>>
>>> logging.debug("Timeout elapsed or process terminated")
>>> status = sub.get_status()
>>> sub.close()
>>> return status == 0
>>> At this moment, kvm_subprocess server is still running which means
>>> lock_server_running_filename is still locked. But sub.get_status()
>>> tries to lock it again.  If kvm_subprocess server keeps running,
>>> a deadlock will happen. This patch will fix this issue by enable
>>
>> Agreed.  It's a mistake (my mistake) to call get_status() on a
>> process
>> that's still running and isn't expected to terminate soon.  I think
>> even
>> the docstring of get_status() says that it blocks, so that's expected
>> behavior.
>> However, there's a simple solution to that, and I don't see why an
>> additional timeout is necessary.
>>
>>> timeout parameter. Update default value for timeout to 600, it
>> should
>>> be enough.
>>>
>>> 2. Add "-v" in scp command to catch more infomation. Also add "Exit
>> status"
>>> and "stalled" match prompt in remote_scp().
>>> Signed-off-by: Feng Yang 
>>> ---
>>>  client/tests/kvm/kvm_utils.py |   36
>> 
>>>  client/tests/kvm/kvm_vm.py|4 ++--
>>>  2 files changed, 30 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/client/tests/kvm/kvm_utils.py
>> b/client/tests/kvm/kvm_utils.py
>>> index 25f3c8c..3db4dec 100644
>>> --- a/client/tests/kvm/kvm_utils.py
>>> +++ b/client/tests/kvm/kvm_utils.py
>>> @@ -524,7 +524,7 @@ def remote_login(command, password, prompt,
>> linesep="\n", timeout=10):
>>>  return None
>>>  
>>>  
>>> -def remote_scp(command, password, timeout=300, login_timeout=10):
>>> +def remote_scp(command, password, timeout=600, login_timeout=10):
>>>  """
>>>  Run the given command using kvm_spawn and provide answers to
>> the questions
>>>  asked. If timeout expires while waiting for the transfer to
>> complete ,
>>> @@ -548,12 +548,18 @@ def remote_scp(command, password, timeout=300,
>> login_timeout=10):
>>>  
>>>  password_prompt_count = 0
>>>  _timeout = login_timeout
>>> +end_time = time.time() + timeout
>>> +logging.debug("Trying to SCP...")
>>>  
>>> -logging.debug("Trying to login...")
>>>  
>>>  while True:
>>> +if end_time <= time.time():
>>> +logging.debug("transfer timeout!")
>>> +sub.close()
>>> +return False
>>>  (match, text) = sub.read_until_last_line_matches(
>>> -[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost
>> connection"],
>>> +[r"[Aa]re you sure", r"[Pp]assword:\s*$", r"lost
>> connection",
>>> + r"Exit status", r"stalled"],
>>>  timeout=_timeout, internal_timeout=0.5)
>>>  if match == 0:  # "Are you sure you want to continue
>> connecting"
>>>  logging.debug("Got 'Are you sure...'; sending 'yes'")
>>> @@ -574,15 +580,29 @@ def remote_scp(command, password, timeout=300,
>> login_timeout=10):
>>>  logging.debug("Got 'lost connection'")
>>>  sub.close()
>>>  return False
>>> +elif match == 3: # "Exit status"
>>
>> This check for "Exit status" is redundant.  When the process
>> terminates,
>> read_until_last_line_matches() will return None and get_status() will
>> return the exit status.
> Here check for "Exit status", we can get not only the exit status,but also 
> some useful debug information when exit status is not 0.
> Because we have enable '-v' in scp command.
> 
> but read_until_last_line_matches() only return exit status.

You get the same information from read_until_last_line_matches().  It
returns (match, text).  If match is None, and sub.get_status() != 0,
then something bad happened, and then we can print text.

>>
>>> +sub.close()
>>> +if "Exit status 0" in text:
>>> +logging.debug("SCP command completed
>> successfully")
>>> +return True
>>> +else:
>>> +logging.debug("SCP command fail with exit status
>> %s" % text)
>>> +return False
>>> +elif match == 4: # "stalled"
>>> +logging.debug("SCP connection stalled for some
>> reason")
>>> +continue
>>> +
>>>  else:  # match == None
>>> -logging.debug("Timeout elapsed or process terminated")
>>> +if sub.is_alive():
>>> +continue
>>> +  

[PATCH][v3] KVM: VMX: Enable XSAVE/XRSTORE for guest

2010-05-20 Thread Sheng Yang
From: Dexuan Cui 

Enable XSAVE/XRSTORE for guest.

Change from V2:
Addressed comments from Avi.

Change from V1:

1. Use FPU API.
2. Fix CPUID issue.
3. Save/restore all possible guest xstate fields when switching. Because we
don't know which fields guest has already touched.

Signed-off-by: Dexuan Cui 
Signed-off-by: Sheng Yang 
---

Avi, could you help to review this kernel patch first? Testcase and LM are in
progress now.

 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/vmx.h  |1 +
 arch/x86/kvm/vmx.c  |   28 
 arch/x86/kvm/x86.c  |   88 +++---
 4 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d08bb4a..3938bd1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -302,6 +302,7 @@ struct kvm_vcpu_arch {
} update_pte;
 
struct fpu guest_fpu;
+   u64 xcr0;
 
gva_t mmio_fault_cr2;
struct kvm_pio_request pio;
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 9e6779f..346ea66 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -266,6 +266,7 @@ enum vmcs_field {
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD 54
+#define EXIT_REASON_XSETBV 55
 
 /*
  * Interruption-information format
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 99ae513..a63f206 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -36,6 +36,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "trace.h"
 
@@ -247,6 +249,9 @@ static const u32 vmx_msr_index[] = {
 };
 #define NR_VMX_MSR ARRAY_SIZE(vmx_msr_index)
 
+#define MERGE_TO_U64(low, high) \
+   (((low) & -1u) | ((u64)((high) & -1u) << 32))
+
 static inline bool is_page_fault(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK |
@@ -3354,6 +3359,28 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_xsetbv(struct kvm_vcpu *vcpu)
+{
+   u64 new_bv = MERGE_TO_U64(kvm_register_read(vcpu, VCPU_REGS_RAX),
+   kvm_register_read(vcpu, VCPU_REGS_RDX));
+
+   if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0)
+   goto err;
+   if (vmx_get_cpl(vcpu) != 0)
+   goto err;
+   if (!(new_bv & XSTATE_FP))
+   goto err;
+   if ((new_bv & XSTATE_YMM) && !(new_bv & XSTATE_SSE))
+   goto err;
+   vcpu->arch.xcr0 = new_bv;
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
+   skip_emulated_instruction(vcpu);
+   return 1;
+err:
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+}
+
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
@@ -3632,6 +3659,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu 
*vcpu) = {
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
[EXIT_REASON_WBINVD]  = handle_wbinvd,
+   [EXIT_REASON_XSETBV]  = handle_xsetbv,
[EXIT_REASON_TASK_SWITCH] = handle_task_switch,
[EXIT_REASON_MCE_DURING_VMENTRY]  = handle_machine_check,
[EXIT_REASON_EPT_VIOLATION]   = handle_ept_violation,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7be1d36..7580d14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -64,6 +64,7 @@
(~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR  \
+ | X86_CR4_OSXSAVE \
  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
@@ -149,6 +150,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
 };
 
+static u64 host_xcr0;
+
+static inline u32 bit(int bitno)
+{
+   return 1 << (bitno & 31);
+}
+
 static void kvm_on_user_return(struct user_return_notifier *urn)
 {
unsigned slot;
@@ -473,6 +481,30 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
 }
 EXPORT_SYMBOL_GPL(kvm_lmsw);
 
+static bool guest_cpuid_has_xsave(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   return best && (best->ecx & bit(X86_FEATURE_XSAVE));
+}
+
+static void update_cpuid(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   if (!best)
+   return;
+
+   /* Update OSXSAVE bit */
+   if (cpu_has_xsave && best->function == 0x1) {
+   best->ecx 

Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Jes Sorensen
On 05/20/10 10:19, Avi Kivity wrote:
> On 05/20/2010 11:15 AM, Jes Sorensen wrote:   
>>> Two things are missing to make this really useful:
>>>
>>> - a continuously updating difference mode like kvm_stat
>>> - subevents; for example kvm:kvm_exit is an aggregate of all exit types
>>> that can be split using filters to show individual exit reason
>>> statistics
>>>  
>> Third missing item, which I find really useful:
>> - run once spit out raw counters
>>
>> For some operations, like file system benchmarking, it is useful to
>> sample the counters before and after and then divide the raw number of
>> events by the number of IOPS performed by the benchmark. If perf spits
>> out events/sec it's kinda hard to get this.
>>
> 
> That's 'perf stat -a sleep 2'

sleep 2 doesn't really cut it, I guess you could do it with

perf stat -a sleep 0.1

but to be honest, that is pretty ugly. Something like this would be a
lot nicer:

perf stat -a -raw

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Avi Kivity

On 05/20/2010 11:15 AM, Jes Sorensen wrote:

On 05/13/10 16:57, Avi Kivity wrote:
   

On 05/13/2010 05:35 PM, Stefan Hajnoczi wrote:
 

How to count and trace KVM perf events:

http://www.linux-kvm.org/page/Perf_events

I want to draw attention to this because traditional kvm_stat and
kvm_trace use has been moving over to the debugfs based tracing
mechanisms.  Perhaps we can flesh out documentation and examples of
common perf event usage.


   

Two things are missing to make this really useful:

- a continuously updating difference mode like kvm_stat
- subevents; for example kvm:kvm_exit is an aggregate of all exit types
that can be split using filters to show individual exit reason statistics
 

Third missing item, which I find really useful:
- run once spit out raw counters

For some operations, like file system benchmarking, it is useful to
sample the counters before and after and then divide the raw number of
events by the number of IOPS performed by the benchmark. If perf spits
out events/sec it's kinda hard to get this.
   


That's 'perf stat -a sleep 2'

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki docs on counting and tracing KVM perf events

2010-05-20 Thread Jes Sorensen
On 05/13/10 16:57, Avi Kivity wrote:
> On 05/13/2010 05:35 PM, Stefan Hajnoczi wrote:
>> How to count and trace KVM perf events:
>>
>> http://www.linux-kvm.org/page/Perf_events
>>
>> I want to draw attention to this because traditional kvm_stat and
>> kvm_trace use has been moving over to the debugfs based tracing
>> mechanisms.  Perhaps we can flesh out documentation and examples of
>> common perf event usage.
>>
>>
> 
> Two things are missing to make this really useful:
> 
> - a continuously updating difference mode like kvm_stat
> - subevents; for example kvm:kvm_exit is an aggregate of all exit types
> that can be split using filters to show individual exit reason statistics

Third missing item, which I find really useful:
- run once spit out raw counters

For some operations, like file system benchmarking, it is useful to
sample the counters before and after and then divide the raw number of
events by the number of IOPS performed by the benchmark. If perf spits
out events/sec it's kinda hard to get this.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] pc: fix segfault introduced by 3d53f5c36ff6

2010-05-20 Thread Isaku Yamahata
Thank you for fixing it. Probably I was too in hurry when rebasing the patches.

Acked-by: Isaku Yamahata 

On Thu, May 20, 2010 at 09:14:04AM +0300, Eduard - Gabriel Munteanu wrote:
> Commit 3d53f5c36ff6 introduced a segfault by erroneously making fw_cfg a
> 'void **' and passing it around in different ways.
> 
> Signed-off-by: Eduard - Gabriel Munteanu 
> ---
>  hw/pc.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pc.c b/hw/pc.c
> index fee08c9..4a4a706 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -822,7 +822,7 @@ void pc_memory_init(ram_addr_t ram_size,
>  ram_addr_t ram_addr, bios_offset, option_rom_offset;
>  ram_addr_t below_4g_mem_size, above_4g_mem_size = 0;
>  int bios_size, isa_bios_size;
> -void **fw_cfg;
> +void *fw_cfg;
>  
>  if (ram_size >= 0xe000 ) {
>  above_4g_mem_size = ram_size - 0xe000;
> @@ -905,7 +905,7 @@ void pc_memory_init(ram_addr_t ram_size,
>  rom_set_fw(fw_cfg);
>  
>  if (linux_boot) {
> -load_linux(*fw_cfg, kernel_filename, initrd_filename, 
> kernel_cmdline, below_4g_mem_size);
> +load_linux(fw_cfg, kernel_filename, initrd_filename, kernel_cmdline, 
> below_4g_mem_size);
>  }
>  
>  for (i = 0; i < nb_option_roms; i++) {
> -- 
> 1.6.4.4
> 
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-20 Thread Jes Sorensen
On 05/19/10 15:34, Anthony Liguori wrote:
> On 05/19/2010 12:04 AM, Aurelien Jarno wrote:
>> The idea is nice, but would it be possible to hold this on a week-end,
>> I personally won't be able to attend such thing on a day week.
>>
>> Or maybe holding that on two days: friday and saturday so that people
>> can participate at least one of the two days, depending if they do that
>> from work or from home.
> 
> The work week in Israel is Sunday - Thursday.
> 
> It would have to be Sunday and Monday but honestly, I think both days
> tend to be bad for this sort of thing.
> 
> I'd much rather do more frequent bug days and alternate between a
> weekday and a Saturday.

If we settle for the 2nd of June, maybe the people who are unavailable
on week days, could run a pre-bug day on Sunday the 30th. Maybe some of
us would be able to stop by the channel briefly on the Sunday even if we
plan to do the big bug day on the 2nd?

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-20 Thread Jes Sorensen
On 05/19/10 02:58, Natalia Portillo wrote:
> Hi,
> 
>> - We'll try to migrate as many confirmable bugs from the Source Forge 
>> tracker to Launchpad.
> I think that part of the bug day should also include retesting OSes that 
> appear in OS Support List as having bug and confirming if the bug is still 
> present and if it's in Launchpad or not.

This would be a great task for people who would like to contribute, but
maybe don't feel they have the experience or knowledge to hack on the
code itself.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-20 Thread Andre Przywara

Michael Tokarev wrote:

20.05.2010 02:30, Anthony Liguori wrote:

On 05/19/2010 05:29 PM, Andre Przywara wrote:

Michael Tokarev wrote:

...

Also, thanks to Andre Przywara, whole winNT thing works but it requires
-cpu qemu64,level=1 (or level=2 or =3), -- _not_ with default CPU. This

[]

It'd be nice if we had more flexibility in defining custom machine types
so you could just do qemu -M win98.


This is wrong IMHO.  win98 and winNT can run on various different
machines, including all modern ones (yes I tried the same winNT
on my Athlon X2-64, just had to switch SATA from AHCI to IDE;
win95 works too)...  just not in kvm :)
Well, not really. You were lucky with your Athlon X2-64, actually it is 
the last machine not triggering the bug. I tried it on a AthlonII-X4 
(which has maxleaf=5 as any newer AMD machines) and it showed the same 
bug. On Intel boxes this bug should trigger on every CPU starting with 
some Pentium4 models, including all Core chips.

Have you tried versions with a newer service pack (SP6)?


BTW: Does anyone knows what the problem with Windows95/98 on KVM is? I
tried some tracing today, but couldn't find a hint.


Um.  The bugreport(s) come as a surprize for me: I tried to
install win98 in kvm several times in the past but setup
always failed - different messages in different versions
of kvm, either "unable to emulate" or "real mode trap" or
something else, or just lockup, usually on first reboot.
So - the bugreports talks about mouse non-working, but
this means win98 itself works somehow...  I dunno :)
I think these bug reports are about plain QEMU. I tried it yesterday, in 
fact the mouse is non-functional. In KVM Windows95 gives me a black 
screen after the welcome screen with the moving bottom row. There are 
just two lines at the top: (translated from the german version)

While initializing device NTKERN:
Windows protection fault. Restart the computer.

KVM catched some #UDs due to ARPL from VM86 mode, but TCG got them too 
and it survived. So if anyone has some more hints, I'd be grateful.


Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


computer frozen

2010-05-20 Thread magicboiz
Hello

since kernel 2.6.28 or 2.6.29, I don't remember exactly, whenever I try to run 
KVM in my laptop, I get my computer totally frozen.

I'd try:
 - "-no-kvm" flag: works, but very slow
 - "-cpu qemu32,-nx": frozen
 - "-no-acpi" flag: frozen

I'd try with several kernels (ubuntu and openssuse kernels), also with custom 
kernels compiled by me (with the minimal options enabled)but always the 
same result: computer frozen

An interesting point: with Sun VirtualBox 3.1, the same frozen result.

My laptop is a TOSHIBA TECRA S4 (europe model only).

magicb...@linux-ue9l:~/> cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz
stepping: 6
cpu MHz : 2000.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips: 3989.99
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz
stepping: 6
cpu MHz : 2000.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
apicid  : 1
initial apicid  : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips: 3995.44
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:







Anyone can help me?

Thx in advance.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/12] kvm: enable smp > 1

2010-05-20 Thread Udo Lembke

Avi Kivity schrieb:

On 05/19/2010 11:02 PM, Udo Lembke wrote:

Unrelated, what are your smp issues?



If i use one cpu i got a good io-performance:
e.g. over 500MB/s at the profile "install" of the io-benchmark 
h2benchw.exe.
( aio=threads | SAS-Raid-0 | 
ftp://ftp.heise.de/pub/ct/ctsi/h2benchw.zip | hwbenchw.exe -p -w 
iotest 0)

The same test but with two cpus gives results between 27 and 298 MB/s!

Also in real life it's noticeable not only with an benchmark. I use a 
win-vm with two cpu for postscript-ripping and have a performance 
drop due to the bad io.



Hi,

What's your block device model?  virtio or ide?
in the test described before i used virtio, but the same happens with 
ide (but of course slightly different values).


What does cpu usage look like on guest or host?
On the guest it's looks like the io-process flap between the cpus. 
Windows show both cpus together are around 65% (less or more) , but if 
one CPU-usage rise, the other drop.

On the host:
 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
5386 root  20   0 1160m 1.0g 1552 R  109 13.5   1:23.58 kvm

The guest is a win-xp, but the same happens in real life on a win2003.

Udo



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 08:01 AM, Rusty Russell wrote:



A device with out of order
completion (like virtio-blk) will quickly randomize the unused
descriptor indexes, so every descriptor fetch will require a bounce.

In contrast, if the rings hold the descriptors themselves instead of
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
every descriptor, amortized.
 

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.
   


Have just one ring, no indexes.  The producer places descriptors into 
the ring and updates the head,  The consumer copies out descriptors to 
be processed and copies back in completed descriptors.  Chaining is 
always linear.  The descriptors contain a tag that allow the producer to 
identify the completion.


Indirect only pays when there are enough descriptors in it to fill a 
couple of cache lines.  Otherwise it's an extra bounce.


We will always bounce here, that what happens when transferring data.  
The question is whether how many cache lines per descriptor.  A pointer 
adds 1 bounce, linear descriptors cost 1/4 bounce, chained descriptors 
cost a bounce.  So best is one ring of linearly chained descriptors.  
Indirect works when you have large requests (like block).



So client writes descriptor page and adds to avail ring, then writes to
index.
Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.
   


Cache lines are 64 bytes these days.

With a single ring, client writes descriptors (ceil(N/4)), updates head 
(1).  Server reads head (1) copies out descriptors (ceil(N/4)), issues 
requests, copies back completions ((ceil(N/4)), updates tail (1).  
Client reads back tail and descriptors (1 + ceil(N/4))


Worst case: 4 + 4 * ceil(N/4).  Best case I think this drops by half.




(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.


2 bounces per descriptor due to random access.


   Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).
   


Indirect adds a double bounce to get to the descriptor table, but any 
descriptors there are accessed linearly.  It's only good when you have 
large chains.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   


Interesting!  So a peer will read until it hits a wall.  But how to 
update the wall atomically?


Maybe we can have a flag in the descriptor indicate headness or 
tailness.  Update looks ugly though: write descriptor with head flag, 
write next descriptor with head flag, remove flag from previous descriptor.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html