from:"Michael S. Tsirkin"

Re: [PATCH] slow_map: minor improvements to ROM BAR handling

2009-12-22 Thread Michael S. Tsirkin

On Tue, Dec 22, 2009 at 02:34:42PM +0100, Alexander Graf wrote:
 Michael S. Tsirkin wrote:
  On Tue, Dec 22, 2009 at 01:05:23PM +0100, Alexander Graf wrote:

  Michael S. Tsirkin wrote:
  
  ROM BAR can be handled same as regular BAR:
  load_option_roms utility will take care of
  copying it to RAM as appropriate.
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
 
  This patch applies on top of agraf's one,
  it takes care of non-page aligned ROM BARs as well:
  they mostly are taken care of, we just do not
  need to warn user about them.
 
   hw/device-assignment.c |   20 +---
   1 files changed, 9 insertions(+), 11 deletions(-)
 
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 000fa61..066fdb6 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -486,25 +486,23 @@ static int assigned_dev_register_regions(PCIRegion 
  *io_regions,
   : PCI_BASE_ADDRESS_SPACE_MEMORY;
   
   if (cur_region-size  0xFFF) {
  -fprintf(stderr, PCI region %d at address 0x%llx 
  -has size 0x%x, which is not a multiple of 4K. 
  -You might experience some performance hit due 
  to that.\n,
  -i, (unsigned long long)cur_region-base_addr,
  -cur_region-size);
  +if (i != PCI_ROM_SLOT) {
  +fprintf(stderr, PCI region %d at address 0x%llx 
  +has size 0x%x, which is not a multiple of 
  4K. 
  +You might experience some performance hit 
  +due to that.\n,
  +i, (unsigned long long)cur_region-base_addr,
  +cur_region-size);
  +}
   slow_map = 1;


  This is wrong. You're setting slow_map = 1 on code that is very likely
  to be executed inside the guest. That doesn't work.
  
 
  It is? Can you really run code directly from a PCI card?
  I looked at BIOS boot specification and it always talks
  about shadowing PCI ROMs.

 
 I'm not sure the BIOS is the only one executing ROMs. If it is, then I'm
 good with the change.
 Maybe it'd make sense to also add a read only flag so we don't
 accidently try to write to the ROM region with slow_map.
 
 Alex

Correct: I think it's made readonly down the road with mprotect,
so attempt to do so will crash qemu :)

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] slow_map: minor improvements to ROM BAR handling

2009-12-22 Thread Michael S. Tsirkin

On Tue, Dec 22, 2009 at 05:00:52PM +0100, Alexander Graf wrote:
 Avi Kivity wrote:
  On 12/22/2009 05:41 PM, Alexander Graf wrote:
 
  We could certainly extend emulate.c to fetch instruction bytes from
  userspace.  It uses -read_std() now, so we'd need to switch to
  -read_emulated() and add appropriate buffering.
   
  I thought the policy on emulate.c was to not have a full instruction
  emulator but only emulate instructions that do PT modifications or MMIO
  access?
 
 
  It's not a policy, just laziness.  With emulate_invalid_guest_state=1
  we need many more instructions.  Of course I don't want to add
  instructions just for the sake of it, since they will be untested.
 
  I'd much prefer not to run from mmio if possible - just pointing out
  it's doable.
 
 Right...
 
  emulator is _really_ small. It only does a few MMU specific
  instructions, a couple of privileged ones and MMIO accessing ones.
 
  Btw, we're in the same situation with PowerPC here. The instruction
 
  Plus, you have a fixed length instruction length, likely more regular
  too.  I imagine powerpc is load/store, so you don't have to emulate a
  zillion ALU instructions?
 
 Well, it's certainly doable (and easier than on x86). But I'm on the
 same position as you on the x86 side. Why increase the emulator size at
 least 10 times if we don't have to?
 
 Either way, people will report bugs when / if they actually start
 executing code off MMIO. So let's not care too much about it for now.
 Just make sure the read-only check is in.
 
 Alex

So I think all we need is this on top?

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 066fdb6..0c3c8f4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -233,7 +233,8 @@ static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, 
int region_num,
 int m;
 
 DEBUG(slow map\n);
-m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region);
+m = cpu_register_io_memory(slow_bar_read, region_num == PCI_ROM_SLOT ?
+   NULL : slow_bar_write, region);
 cpu_register_physical_memory(e_phys, e_size, m);
 
 /* MSI-X MMIO page */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] slow_map: minor improvements to ROM BAR handling

2009-12-23 Thread Michael S. Tsirkin

ROM BAR can be handled same as regular BAR:
load_option_roms utility will take care of
copying it to RAM as appropriate.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Changes from v1: made ROM BAR read-only.

 hw/device-assignment.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 000fa61..0c3c8f4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -233,7 +233,8 @@ static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, 
int region_num,
 int m;
 
 DEBUG(slow map\n);
-m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region);
+m = cpu_register_io_memory(slow_bar_read, region_num == PCI_ROM_SLOT ?
+   NULL : slow_bar_write, region);
 cpu_register_physical_memory(e_phys, e_size, m);
 
 /* MSI-X MMIO page */
@@ -486,25 +487,23 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 : PCI_BASE_ADDRESS_SPACE_MEMORY;
 
 if (cur_region-size  0xFFF) {
-fprintf(stderr, PCI region %d at address 0x%llx 
-has size 0x%x, which is not a multiple of 4K. 
-You might experience some performance hit due to 
that.\n,
-i, (unsigned long long)cur_region-base_addr,
-cur_region-size);
+if (i != PCI_ROM_SLOT) {
+fprintf(stderr, PCI region %d at address 0x%llx 
+has size 0x%x, which is not a multiple of 4K. 
+You might experience some performance hit 
+due to that.\n,
+i, (unsigned long long)cur_region-base_addr,
+cur_region-size);
+}
 slow_map = 1;
 }
 
-if (slow_map  (i == PCI_ROM_SLOT)) {
-fprintf(stderr, ROM not aligned - can't continue\n);
-return -1;
-}
-
 /* map physical memory */
 pci_dev-v_addrs[i].e_physbase = cur_region-base_addr;
 if (i == PCI_ROM_SLOT) {
 pci_dev-v_addrs[i].u.r_virtbase =
 mmap(NULL,
- (cur_region-size + 0xFFF)  0xF000,
+ cur_region-size,
  PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
  0, (off_t) 0);
 
-- 
1.6.6.rc1.43.gf55cc
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] slow_map: minor improvements to ROM BAR handling

2009-12-23 Thread Michael S. Tsirkin

On Tue, Dec 22, 2009 at 05:39:22PM +0200, Avi Kivity wrote:
 On 12/22/2009 05:36 PM, Alexander Graf wrote:

 Is there a way to trap this and fprintf something?
  
 I don't think so. KVM will just trap on execution outside of RAM and
 either fail badly or throw something bad into the guest. MMIO access
 works by analyzing the instruction that accesses the MMIO address. That
 just doesn't work when we don't have an instruction to analyze.


 We could certainly extend emulate.c to fetch instruction bytes from  
 userspace.  It uses -read_std() now, so we'd need to switch to  
 -read_emulated() and add appropriate buffering.

You mean run with KVM, and TCG will kick in when there's
an instruction we can't support natively?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3] slow_map: minor improvements to ROM BAR handling

2009-12-23 Thread Michael S. Tsirkin

ROM BAR can be handled same as regular BAR:
load_option_roms utility will take care of
copying it to RAM as appropriate.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Changes from v2:
replace ?: with an if statement
changes from v1:
make ROM bar read only

 hw/device-assignment.c |   25 +
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 000fa61..9197dca 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -233,7 +233,10 @@ static void assigned_dev_iomem_map_slow(PCIDevice 
*pci_dev, int region_num,
 int m;
 
 DEBUG(slow map\n);
-m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region);
+if (region_num == PCI_ROM_SLOT)
+m = cpu_register_io_memory(slow_bar_read, NULL, region);
+else
+m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region);
 cpu_register_physical_memory(e_phys, e_size, m);
 
 /* MSI-X MMIO page */
@@ -486,25 +489,23 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 : PCI_BASE_ADDRESS_SPACE_MEMORY;
 
 if (cur_region-size  0xFFF) {
-fprintf(stderr, PCI region %d at address 0x%llx 
-has size 0x%x, which is not a multiple of 4K. 
-You might experience some performance hit due to 
that.\n,
-i, (unsigned long long)cur_region-base_addr,
-cur_region-size);
+if (i != PCI_ROM_SLOT) {
+fprintf(stderr, PCI region %d at address 0x%llx 
+has size 0x%x, which is not a multiple of 4K. 
+You might experience some performance hit 
+due to that.\n,
+i, (unsigned long long)cur_region-base_addr,
+cur_region-size);
+}
 slow_map = 1;
 }
 
-if (slow_map  (i == PCI_ROM_SLOT)) {
-fprintf(stderr, ROM not aligned - can't continue\n);
-return -1;
-}
-
 /* map physical memory */
 pci_dev-v_addrs[i].e_physbase = cur_region-base_addr;
 if (i == PCI_ROM_SLOT) {
 pci_dev-v_addrs[i].u.r_virtbase =
 mmap(NULL,
- (cur_region-size + 0xFFF)  0xF000,
+ cur_region-size,
  PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
  0, (off_t) 0);
 
-- 
1.6.6.rc1.43.gf55cc
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vhost: access check thinko fixes

2009-12-23 Thread Michael S. Tsirkin

This fixes two issues with recent access checking patch:
1.  if (d-vqs[i].private_data) - if (d-vqs[i].private_data)
2.  we can't forbid log base changes while ring is running,
because host needs to resize log in rare cases
(e.g. when memory is added with a baloon)
and in that case it needs to increase log size with realloc,
which might move the log address.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Rusty, this needs to be applied on top of the access_ok patch.
If you want me to rool it in with that one and esend, let me
know please.
Thanks!

 drivers/vhost/vhost.c |   39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2b65d9b..c8c25db 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -230,7 +230,7 @@ static int log_access_ok(void __user *log_base, u64 addr, 
unsigned long sz)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(struct vhost_virtqueue *vq, struct vhost_memory 
*mem,
+static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
   int log_all)
 {
int i;
@@ -242,7 +242,7 @@ static int vq_memory_access_ok(struct vhost_virtqueue *vq, 
struct vhost_memory *
else if (!access_ok(VERIFY_WRITE, (void __user *)a,
m-memory_size))
return 0;
-   else if (log_all  !log_access_ok(vq-log_base,
+   else if (log_all  !log_access_ok(log_base,
   m-guest_phys_addr,
   m-memory_size))
return 0;
@@ -259,9 +259,10 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_memory *mem,
for (i = 0; i  d-nvqs; ++i) {
int ok;
mutex_lock(d-vqs[i].mutex);
-   /* If ring is not running, will check when it's enabled. */
-   if (d-vqs[i].private_data)
-   ok = vq_memory_access_ok(d-vqs[i], mem, log_all);
+   /* If ring is inactive, will check when it's enabled. */
+   if (d-vqs[i].private_data)
+   ok = vq_memory_access_ok(d-vqs[i].log_base, mem,
+log_all);
else
ok = 1;
mutex_unlock(d-vqs[i].mutex);
@@ -290,18 +291,25 @@ int vhost_log_access_ok(struct vhost_dev *dev)
return memory_access_ok(dev, dev-memory, 1);
 }
 
-/* Can we start vq? */
+/* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
-int vhost_vq_access_ok(struct vhost_virtqueue *vq)
+static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 {
-   return vq_access_ok(vq-num, vq-desc, vq-avail, vq-used) 
-   vq_memory_access_ok(vq, vq-dev-memory,
+   return vq_memory_access_ok(log_base, vq-dev-memory,
vhost_has_feature(vq-dev, VHOST_F_LOG_ALL)) 
-   (!vq-log_used || log_access_ok(vq-log_base, vq-log_addr,
+   (!vq-log_used || log_access_ok(log_base, vq-log_addr,
sizeof *vq-used +
vq-num * sizeof *vq-used-ring));
 }
 
+/* Can we start vq? */
+/* Caller should have vq mutex and device mutex */
+int vhost_vq_access_ok(struct vhost_virtqueue *vq)
+{
+   return vq_access_ok(vq-num, vq-desc, vq-avail, vq-used) 
+   vq_log_access_ok(vq, vq-log_base);
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -564,15 +572,14 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, unsigned long arg)
}
for (i = 0; i  d-nvqs; ++i) {
struct vhost_virtqueue *vq;
+   void __user *base = (void __user *)(unsigned long)p;
vq = d-vqs + i;
mutex_lock(vq-mutex);
-   /* Moving log base with an active backend?
-* You don't want to do that. */
-   if (vq-private_data  (vq-log_used ||
-vhost_has_feature(d, VHOST_F_LOG_ALL)))
-   r = -EBUSY;
+   /* If ring is inactive, will check when it's enabled. */
+   if (vq-private_data  !vq_log_access_ok(vq, base))
+   r = -EFAULT;
else
-   vq-log_base = (void __user *)(unsigned long)p;
+   vq-log_base = base;
mutex_unlock(vq-mutex);
}
break;
-- 
1.6.6.rc1.43.gf55cc
--
To unsubscribe from

Re: [PATCH] vhost-net: defer f-private_data until setup succeeds

2009-12-23 Thread Michael S. Tsirkin

On Tue, Dec 22, 2009 at 11:02:28AM -0800, Chris Wright wrote:
 Trivial change, just for readability.  The filp is not installed on
 failure, so the current code is not incorrect (also vhost_dev_init
 currently has no failure case).  This just treats setting f-private_data
 as something with global scope (sure, true only after fd_install).
 
 Signed-off-by: Chris Wright chr...@redhat.com

Acked-by: Michael S. Tsirkin m...@redhat.com

 ---
  drivers/vhost/net.c |4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 22d5fef..0697ab2 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -326,7 +326,6 @@ static int vhost_net_open(struct inode *inode, struct 
 file *f)
   int r;
   if (!n)
   return -ENOMEM;
 - f-private_data = n;
   n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
   n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
   r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX);
 @@ -338,6 +337,9 @@ static int vhost_net_open(struct inode *inode, struct 
 file *f)
   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
   n-tx_poll_state = VHOST_NET_POLL_DISABLED;
 +
 + f-private_data = n;
 +
   return 0;
  }
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-24 Thread Michael S. Tsirkin

On Wed, Dec 23, 2009 at 11:28:08AM -0800, Ira W. Snyder wrote:
 On Wed, Dec 23, 2009 at 12:34:44PM -0500, Gregory Haskins wrote:
  On 12/23/09 1:15 AM, Kyle Moffett wrote:
   On Tue, Dec 22, 2009 at 12:36, Gregory Haskins
   gregory.hask...@gmail.com wrote:
   On 12/22/09 2:57 AM, Ingo Molnar wrote:
   * Gregory Haskins gregory.hask...@gmail.com wrote:
   Actually, these patches have nothing to do with the KVM folks. [...]
  
   That claim is curious to me - the AlacrityVM host
  
   It's quite simple, really.  These drivers support accessing vbus, and
   vbus is hypervisor agnostic.  In fact, vbus isn't necessarily even
   hypervisor related.  It may be used anywhere where a Linux kernel is the
   io backend, which includes hypervisors like AlacrityVM, but also
   userspace apps, and interconnected physical systems as well.

So focus on interconnecting physical systems I think would be one way
for vbus to stop conflicting with KVM. If drivers for such systems
appear I expect that relevant (hypervisor-agnostic) vbus bits would be
very uncontroversial.

This would not be the first technology to make the jump from attempting
to be a PCI replacement to being an interconnect btw, I think infiniband
did this as well.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: rcu annotation fixup

2011-01-18 Thread Michael S. Tsirkin

On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote:
 On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
  When built with rcu checks enabled, vhost triggers
  bogus warnings as vhost features are read without
  dev-mutex sometimes.
  Fixing it properly is not trivial as vhost.h does not
  know which lockdep classes it will be used under.
  Disable the warning by stubbing out the check for now.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   drivers/vhost/vhost.h |4 +---
   1 files changed, 1 insertions(+), 3 deletions(-)
  
  diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
  index 2af44b7..2d03a31 100644
  --- a/drivers/vhost/vhost.h
  +++ b/drivers/vhost/vhost.h
  @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev 
  *dev, int bit)
   {
  unsigned acked_features;
  
  -   acked_features =
  -   rcu_dereference_index_check(dev-acked_features,
  -   lockdep_is_held(dev-mutex));
  +   acked_features = rcu_dereference_index_check(dev-acked_features, 1);
 
 Ouch!!!
 
 Could you please at least add a comment?

Yes, OK.

 Alternatively, pass in the lock that is held and check for that?  Given
 that this is a static inline, the compiler should be able to optimize
 the argument away when !PROVE_RCU, correct?
 
   Thanx, Paul

Hopefully, yes. We don't always have a lock: the idea was
to create a lockdep for these cases. But we can't pass
the pointer to that ...

  return acked_features  (1  bit);
   }
  
  -- 
  1.7.3.2.91.g446ac
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] vhost: rcu annotation fixup

2011-01-18 Thread Michael S. Tsirkin

When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev-mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.

Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Changes from v1: add TODO, fix more warnings.

 drivers/vhost/net.c   |9 +
 drivers/vhost/vhost.h |6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..f616cef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -128,8 +128,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
 
-   /* TODO: check that we are running from vhost_worker?
-* Not sure it's worth it, it's straight-forward enough. */
+   /* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq-private_data, 1);
if (!sock)
return;
@@ -306,7 +305,8 @@ static void handle_rx_big(struct vhost_net *net)
size_t len, total_len = 0;
int err;
size_t hdr_size;
-   struct socket *sock = rcu_dereference(vq-private_data);
+   /* TODO: check that we are running from vhost_worker? */
+   struct socket *sock = rcu_dereference_check(vq-private_data, 1);
if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
return;
 
@@ -415,7 +415,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
int err, headcount;
size_t vhost_hlen, sock_hlen;
size_t vhost_len, sock_len;
-   struct socket *sock = rcu_dereference(vq-private_data);
+   /* TODO: check that we are running from vhost_worker? */
+   struct socket *sock = rcu_dereference_check(vq-private_data, 1);
if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
return;
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2af44b7..b3363ae 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -173,9 +173,9 @@ static inline int vhost_has_feature(struct vhost_dev *dev, 
int bit)
 {
unsigned acked_features;
 
-   acked_features =
-   rcu_dereference_index_check(dev-acked_features,
-   lockdep_is_held(dev-mutex));
+   /* TODO: check that we are running from vhost_worker or dev mutex is
+* held? */
+   acked_features = rcu_dereference_index_check(dev-acked_features, 1);
return acked_features  (1  bit);
 }
 
-- 
1.7.3.2.91.g446ac
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: rcu annotation fixup

2011-01-18 Thread Michael S. Tsirkin

On Tue, Jan 18, 2011 at 11:02:33AM -0800, Paul E. McKenney wrote:
 On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote:
  On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote:
   On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev-mutex sometimes.
Fixing it properly is not trivial as vhost.h does not
know which lockdep classes it will be used under.
Disable the warning by stubbing out the check for now.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/vhost.h |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2af44b7..2d03a31 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct 
vhost_dev *dev, int bit)
 {
unsigned acked_features;

-   acked_features =
-   rcu_dereference_index_check(dev-acked_features,
-   
lockdep_is_held(dev-mutex));
+   acked_features = 
rcu_dereference_index_check(dev-acked_features, 1);
   
   Ouch!!!
   
   Could you please at least add a comment?
  
  Yes, OK.
  
   Alternatively, pass in the lock that is held and check for that?  Given
   that this is a static inline, the compiler should be able to optimize
   the argument away when !PROVE_RCU, correct?
   
 Thanx, Paul
  
  Hopefully, yes. We don't always have a lock: the idea was
  to create a lockdep for these cases. But we can't pass
  the pointer to that ...
 
 I suppose you could pass a pointer to the lockdep map structure.
 Not sure if this makes sense, but it would handle the situation.

Will it compile with lockdep disabled too? What will the pointer be?

 Alternatively, create a helper function that checks the possibilities
 and screams if none of them are in effect.
 
   Thanx, Paul

The problem here is the callee needs to know about all callers.

return acked_features  (1  bit);
 }

-- 
1.7.3.2.91.g446ac
--
To unsubscribe from this list: send the line unsubscribe linux-kernel 
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-18 Thread Michael S. Tsirkin

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
 So it won't be all that simple to implement well, and before we try,
 I'd like to know whether there are applications that are helped
 by it. For example, we could try to measure latency at various
 pps and see whether the backpressure helps. netperf has -b, -w
 flags which might help these measurements.
 
 Those options are enabled when one adds --enable-burst to the
 pre-compilation ./configure  of netperf (one doesn't have to
 recompile netserver).  However, if one is also looking at latency
 statistics via the -j option in the top-of-trunk, or simply at the
 histogram with --enable-histogram on the ./configure and a verbosity
 level of 2 (global -v 2) then one wants the very top of trunk
 netperf from:
 
 http://www.netperf.org/svn/netperf2/trunk
 
 to get the recently added support for accurate (netperf level) RTT
 measuremnts on burst-mode request/response tests.
 
 happy benchmarking,
 
 rick jones
 
 PS - the enhanced latency statistics from -j are only available in
 the omni version of the TCP_RR test.  To get that add a
 --enable-omni to the ./configure - and in this case both netperf and
 netserver have to be recompiled.


Is this TCP only? I would love to get latency data from UDP as well.

  For very basic output one can
 peruse the output of:
 
 src/netperf -t omni -- -O /?
 
 and then pick those outputs of interest and put them into an output
 selection file which one then passes to either (test-specific) -o,
 -O or -k to get CVS, Human or keyval output respectively.  E.G.
 
 raj@tardy:~/netperf2_trunk$ cat foo
 THROUGHPUT,THROUGHPUT_UNITS
 RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
 P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
 
 when foo is passed to -o one will get those all on one line of CSV.
 To -O one gets three lines of more netperf-classic-like human
 readable output, and when one passes that to -k one gets a string of
 keyval output a la:
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
 OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost
 (127.0.0.1) port 0 AF_INET : histogram
 THROUGHPUT=29454.12
 THROUGHPUT_UNITS=Trans/s
 RT_LATENCY=33.951
 MIN_LATENCY=19
 MEAN_LATENCY=32.00
 MAX_LATENCY=126
 P50_LATENCY=32
 P90_LATENCY=38
 P99_LATENCY=41
 STDDEV_LATENCY=5.46
 
 Histogram of request/response times
 UNIT_USEC :0:0:0:0:0:0:0:0:0:0
 TEN_USEC  :0: 3553: 45244: 237790: 7859:   86:   10:3:0:0
 HUNDRED_USEC  :0:2:0:0:0:0:0:0:0:0
 UNIT_MSEC :0:0:0:0:0:0:0:0:0:0
 TEN_MSEC  :0:0:0:0:0:0:0:0:0:0
 HUNDRED_MSEC  :0:0:0:0:0:0:0:0:0:0
 UNIT_SEC  :0:0:0:0:0:0:0:0:0:0
 TEN_SEC   :0:0:0:0:0:0:0:0:0:0
 100_SECS: 0
 HIST_TOTAL:  294547
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: rcu annotation fixup

2011-01-18 Thread Michael S. Tsirkin

On Wed, Jan 19, 2011 at 12:40:40AM +, Mel Gorman wrote:
 On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
  When built with rcu checks enabled, vhost triggers
  bogus warnings as vhost features are read without
  dev-mutex sometimes.
  Fixing it properly is not trivial as vhost.h does not
  know which lockdep classes it will be used under.
  Disable the warning by stubbing out the check for now.
  
 
 What is the harm in leaving the bogus warnings until the difficult fix
 happens? RCU checks enabled does not seem like something that is enabled
 in production.

I would like to run with rcu checks enabled sometimes to debug kvm,
which has an elaborate rcu strategy.  Bogus warnings in the log
make it easy to overlook the real ones. Further, the rcu macros
used are a form of documentation. If we have
-   rcu_dereference_index_check(dev-acked_features,
-   lockdep_is_held(dev-mutex));
this means 'taken in rcu read side critical section or under mutex',

+   acked_features = rcu_dereference_index_check(dev-acked_features, 1);
means 'not checked'.

 If this patch is applied, there is always the risk that
 it'll be simply forgotten about.

Well, that's why I put in a TODO.
If there's a demand for that, I can add a Kconfig option to
trigger a warning at each unchecked rcu call in vhost-net
but I doubt it'll get a lof of use :)


 -- 
 Mel Gorman
 Part-time Phd Student  Linux Technology Center
 University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

I need to report some error from virtio-pci
that would be handled specially (disable but don't
report an error) so I wanted one that's never likely to be used by a
userspace ioctl. I selected ERANGE but it'd
be easy to switch to something else. Comments?

 hw/vhost.c  |4 +++-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |3 +++
 hw/virtio.h |2 ++
 4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1d09ed0..c79765a 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 
 r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
 goto fail_notifiers;
 }
 
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ccb3e63..5de3fee 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 if (!n-vhost_started) {
 int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev);
 if (r  0) {
-error_report(unable to start vhost net: %d: 
- falling back on userspace virtio, -r);
+if (r != -EVIRTIO_DISABLED) {
+error_report(unable to start vhost net: %d: 
+ falling back on userspace virtio, -r);
+}
 } else {
 n-vhost_started = 1;
 }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index dd8887a..dbf4be0 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int 
n, bool assign)
 EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
 
 if (assign) {
+if (!msix_enabled(proxy-pci_dev)) {
+return -EVIRTIO_DISABLED;
+}
 int r = event_notifier_init(notifier, 0);
 if (r  0) {
 return r;
diff --git a/hw/virtio.h b/hw/virtio.h
index d8546d5..53bbdba 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -98,6 +98,8 @@ typedef struct {
 void (*vmstate_change)(void * opaque, bool running);
 } VirtIOBindings;
 
+#define EVIRTIO_DISABLED ERANGE
+
 #define VIRTIO_PCI_QUEUE_MAX 64
 
 #define VIRTIO_NO_VECTOR 0x
-- 
1.7.3.2.91.g446ac
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
 On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Signed-off-by: Michael S. Tsirkinm...@redhat.com
 
 I actually think this should be a terminal error.  The user asks for
 vhost-net, if we cannot enable it, we should exit.
 
 Or we should warn the user that they should expect bad performance.
 Silently doing something that the user has explicitly asked us not
 to do is not a good behavior.
 
 Regards,
 
 Anthony Liguori

The issue is that user has no control of the guest, and can not know
whether the guest enables MSI. So what you ask for will just make
some guests fail, and others fail sometimes.
The user also has no way to know that version X of kvm does not expose a
way to inject level interrupts with irqfd.

We could have *another* flag that says use vhost where it helps but
then I think this is what everyone wants to do, anyway, and libvirt
already sets vhost=on so I prefer redefining the meaning of an existing
flag.

Maybe this is best handled by a documentation update?

We always said:
use vhost=on to enable experimental in kernel 
accelerator\n

note 'enable' not 'require'. This is similar to how we specify
nvectors : you can not make guest use the feature.

How about this:

diff --git a/qemu-options.hx b/qemu-options.hx
index 898561d..3c937c1 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net,
 use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap 
flag\n
 use vnet_hdr=on to make the lack of IFF_VNET_HDR support 
an error condition\n
 use vhost=on to enable experimental in kernel 
accelerator\n
+(note: vhost=on has no effect unless guest uses MSI-X)\n
 use 'vhostfd=h' to connect to an already opened vhost net 
device\n
 #endif
 -net 
socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
 On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
  When MSI is off, each interrupt needs to be bounced through the io
  thread when it's set/cleared, so vhost-net causes more context switches and
  higher CPU utilization than userspace virtio which handles networking in
  the same thread.
  
  We'll need to fix this by adding level irq support in kvm irqfd,
  for now disable vhost-net in these configurations.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  I need to report some error from virtio-pci
  that would be handled specially (disable but don't
  report an error) so I wanted one that's never likely to be used by a
  userspace ioctl. I selected ERANGE but it'd
  be easy to switch to something else. Comments?
 
 Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
 
 -Sridhar

The error is reported by virtio-pci which does not know about vhost.
I started with EVIRTIO_MSIX_DISABLED and made is shorter.
Would EVIRTIO_MSIX_DISABLED be better?

  
   hw/vhost.c  |4 +++-
   hw/virtio-net.c |6 --
   hw/virtio-pci.c |3 +++
   hw/virtio.h |2 ++
   4 files changed, 12 insertions(+), 3 deletions(-)
  
  diff --git a/hw/vhost.c b/hw/vhost.c
  index 1d09ed0..c79765a 100644
  --- a/hw/vhost.c
  +++ b/hw/vhost.c
  @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   
   r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
   if (r  0) {
  -fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   if (r != -EVIRTIO_DISABLED) {
  +   fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   }
   goto fail_notifiers;
   }
   
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index ccb3e63..5de3fee 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
  uint8_t status)
   if (!n-vhost_started) {
   int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
  n-vdev);
   if (r  0) {
  -error_report(unable to start vhost net: %d: 
  - falling back on userspace virtio, -r);
  +if (r != -EVIRTIO_DISABLED) {
  +error_report(unable to start vhost net: %d: 
  + falling back on userspace virtio, -r);
  +}
   } else {
   n-vhost_started = 1;
   }
  diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
  index dd8887a..dbf4be0 100644
  --- a/hw/virtio-pci.c
  +++ b/hw/virtio-pci.c
  @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
  int n, bool assign)
   EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
   
   if (assign) {
  +if (!msix_enabled(proxy-pci_dev)) {
  +return -EVIRTIO_DISABLED;
  +}
   int r = event_notifier_init(notifier, 0);
   if (r  0) {
   return r;
  diff --git a/hw/virtio.h b/hw/virtio.h
  index d8546d5..53bbdba 100644
  --- a/hw/virtio.h
  +++ b/hw/virtio.h
  @@ -98,6 +98,8 @@ typedef struct {
   void (*vmstate_change)(void * opaque, bool running);
   } VirtIOBindings;
   
  +#define EVIRTIO_DISABLED ERANGE
  +
   #define VIRTIO_PCI_QUEUE_MAX 64
   
   #define VIRTIO_NO_VECTOR 0x
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-21 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 06:23:36PM -0600, Anthony Liguori wrote:
 On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:
 On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
 On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Signed-off-by: Michael S. Tsirkinm...@redhat.com
 I actually think this should be a terminal error.  The user asks for
 vhost-net, if we cannot enable it, we should exit.
 
 Or we should warn the user that they should expect bad performance.
 Silently doing something that the user has explicitly asked us not
 to do is not a good behavior.
 
 Regards,
 
 Anthony Liguori
 The issue is that user has no control of the guest, and can not know
 whether the guest enables MSI. So what you ask for will just make
 some guests fail, and others fail sometimes.
 The user also has no way to know that version X of kvm does not expose a
 way to inject level interrupts with irqfd.
 
 We could have *another* flag that says use vhost where it helps but
 then I think this is what everyone wants to do, anyway, and libvirt
 already sets vhost=on so I prefer redefining the meaning of an existing
 flag.
 
 In the very least, there needs to be a vhost=force.
 Having some sort of friendly default policy is fine but we need to
 provide a mechanism for a user to have the final say.  If you want
 to redefine vhost=on to really mean, use the friendly default,
 that's fine by me, but only if the vhost=force option exists.

OK, I will add that, probably as a separate flag as vhost
is a boolean.  This will get worse performance but it will be what the
user asked for.

 
 I actually would think libvirt would want to use vhost=force.
 Debugging with vhost=on is going to be a royal pain in the ass if a
 user reports bad performance.  Given the libvirt XML, you can't
 actually tell from the guest and the XML whether or not vhost was
 actually in use or not.

Yes you can: check MSI enabled in the guest, if it is -
check vhost enabled in the XML. Not that bad at all, is it?

 
 Regards,
 
 Anthony Liguori

We get worse performance without MSI anyway, how is this different?

 Maybe this is best handled by a documentation update?
 
 We always said:
  use vhost=on to enable experimental in kernel 
  accelerator\n
 
 note 'enable' not 'require'. This is similar to how we specify
 nvectors : you can not make guest use the feature.
 
 How about this:
 
 diff --git a/qemu-options.hx b/qemu-options.hx
 index 898561d..3c937c1 100644
 --- a/qemu-options.hx
 +++ b/qemu-options.hx
 @@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net,
   use vnet_hdr=off to avoid enabling the IFF_VNET_HDR 
  tap flag\n
   use vnet_hdr=on to make the lack of IFF_VNET_HDR 
  support an error condition\n
   use vhost=on to enable experimental in kernel 
  accelerator\n
 +(note: vhost=on has no effect unless guest uses 
 MSI-X)\n
   use 'vhostfd=h' to connect to an already opened vhost 
  net device\n
   #endif
   -net 
  socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-21 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 06:35:46PM -0700, Alex Williamson wrote:
 On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote:
  On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:
   On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
  
   On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:

   When MSI is off, each interrupt needs to be bounced through the io
   thread when it's set/cleared, so vhost-net causes more context switches 
   and
   higher CPU utilization than userspace virtio which handles networking in
   the same thread.
  
   We'll need to fix this by adding level irq support in kvm irqfd,
   for now disable vhost-net in these configurations.
  
   Signed-off-by: Michael S. Tsirkinm...@redhat.com
  
   I actually think this should be a terminal error.  The user asks for
   vhost-net, if we cannot enable it, we should exit.
  
   Or we should warn the user that they should expect bad performance.
   Silently doing something that the user has explicitly asked us not
   to do is not a good behavior.
  
   Regards,
  
   Anthony Liguori

   The issue is that user has no control of the guest, and can not know
   whether the guest enables MSI. So what you ask for will just make
   some guests fail, and others fail sometimes.
   The user also has no way to know that version X of kvm does not expose a
   way to inject level interrupts with irqfd.
  
   We could have *another* flag that says use vhost where it helps but
   then I think this is what everyone wants to do, anyway, and libvirt
   already sets vhost=on so I prefer redefining the meaning of an existing
   flag.
  
  
  In the very least, there needs to be a vhost=force.
  
  Having some sort of friendly default policy is fine but we need to 
  provide a mechanism for a user to have the final say.  If you want to 
  redefine vhost=on to really mean, use the friendly default, that's fine 
  by me, but only if the vhost=force option exists.
  
  I actually would think libvirt would want to use vhost=force.  Debugging 
  with vhost=on is going to be a royal pain in the ass if a user reports 
  bad performance.  Given the libvirt XML, you can't actually tell from 
  the guest and the XML whether or not vhost was actually in use or not.
 
 If we add a force option, let's please distinguish hotplug from VM
 creation time.  The latter can abort.  Hotplug should print an error and
 fail the initfn.

It can't abort at init - MSI is disabled at init, it needs to be enabled
by the guest later. And aborting the guest in the middle of the run
is a very bad idea.

What vhostforce=true will do is force vhost backend to be used even if
it is slower.

  Thanks,
 
 Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-21 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
 [ Trimmed Eric from CC list as vger was complaining that it is too long ]
 
 On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
  So it won't be all that simple to implement well, and before we try,
  I'd like to know whether there are applications that are helped
  by it. For example, we could try to measure latency at various
  pps and see whether the backpressure helps. netperf has -b, -w
  flags which might help these measurements.
  
  Those options are enabled when one adds --enable-burst to the
  pre-compilation ./configure  of netperf (one doesn't have to
  recompile netserver).  However, if one is also looking at latency
  statistics via the -j option in the top-of-trunk, or simply at the
  histogram with --enable-histogram on the ./configure and a verbosity
  level of 2 (global -v 2) then one wants the very top of trunk
  netperf from:
 
 Hi,
 
 I have constructed a test where I run an un-paced  UDP_STREAM test in
 one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.

 Breifly I get the following results from the omni test..
 
 1. Omni test only:MEAN_LATENCY=272.00
 2. Omni and stream test:  MEAN_LATENCY=3423.00
 3. cpu and net_cls group: MEAN_LATENCY=493.00
As per 2 plus cgoups are created for each guest
and guest tasks added to the groups
 4. 100Mbit/s class:   MEAN_LATENCY=273.00
As per 3 plus the net_cls groups each have a 100MBit/s HTB class
 5. cpu.shares=128:MEAN_LATENCY=652.00
As per 4 plus the cpu groups have cpu.shares set to 128
 6. Busy CPUS: MEAN_LATENCY=15126.00
As per 5 but the CPUs are made busy using a simple shell while loop
 
 There is a bit of noise in the results as the two netperf invocations
 aren't started at exactly the same moment
 
 For reference, my netperf invocations are:
 netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
 netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
 -k foo -b 1 -w 200 -m 200
 
 foo contains
 PROTOCOL
 THROUGHPUT,THROUGHPUT_UNITS
 LOCAL_SEND_THROUGHPUT
 LOCAL_RECV_THROUGHPUT
 REMOTE_SEND_THROUGHPUT
 REMOTE_RECV_THROUGHPUT
 RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
 P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
 LOCAL_CPU_UTIL,REMOTE_CPU_UTIL
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-21 Thread Michael S. Tsirkin

On Fri, Jan 21, 2011 at 06:19:13AM -0700, Alex Williamson wrote:
 On Fri, 2011-01-21 at 11:55 +0200, Michael S. Tsirkin wrote:
  On Thu, Jan 20, 2011 at 06:35:46PM -0700, Alex Williamson wrote:
   On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote:
On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:
 On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:

 On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
  
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context 
 switches and
 higher CPU utilization than userspace virtio which handles 
 networking in
 the same thread.

 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.

 Signed-off-by: Michael S. Tsirkinm...@redhat.com

 I actually think this should be a terminal error.  The user asks for
 vhost-net, if we cannot enable it, we should exit.

 Or we should warn the user that they should expect bad performance.
 Silently doing something that the user has explicitly asked us not
 to do is not a good behavior.

 Regards,

 Anthony Liguori
  
 The issue is that user has no control of the guest, and can not know
 whether the guest enables MSI. So what you ask for will just make
 some guests fail, and others fail sometimes.
 The user also has no way to know that version X of kvm does not 
 expose a
 way to inject level interrupts with irqfd.

 We could have *another* flag that says use vhost where it helps but
 then I think this is what everyone wants to do, anyway, and libvirt
 already sets vhost=on so I prefer redefining the meaning of an 
 existing
 flag.


In the very least, there needs to be a vhost=force.

Having some sort of friendly default policy is fine but we need to 
provide a mechanism for a user to have the final say.  If you want to 
redefine vhost=on to really mean, use the friendly default, that's fine 
by me, but only if the vhost=force option exists.

I actually would think libvirt would want to use vhost=force.  
Debugging 
with vhost=on is going to be a royal pain in the ass if a user reports 
bad performance.  Given the libvirt XML, you can't actually tell from 
the guest and the XML whether or not vhost was actually in use or not.
   
   If we add a force option, let's please distinguish hotplug from VM
   creation time.  The latter can abort.  Hotplug should print an error and
   fail the initfn.
  
  It can't abort at init - MSI is disabled at init, it needs to be enabled
  by the guest later. And aborting the guest in the middle of the run
  is a very bad idea.
 
 Yeah, I was thinking about the ordering of device being added vs guest
 enabling MSI this morning.  Waiting until the guest decides to try to
 start using the device to NAK it with an abort is very undesirable.
 What if when we have vhost=on,force, the device doesn't advertise an
 INTx (PCI_INTERRUPT_PIN = 0)?
 
 Alex

Then we break backward compatibility with old guests.
I don't see what the issue is really:
It is trivial to check that the guest uses MSIX.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-22 Thread Michael S. Tsirkin

On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
 On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
  On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
   [ Trimmed Eric from CC list as vger was complaining that it is too long ]
   
   On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
So it won't be all that simple to implement well, and before we try,
I'd like to know whether there are applications that are helped
by it. For example, we could try to measure latency at various
pps and see whether the backpressure helps. netperf has -b, -w
flags which might help these measurements.

Those options are enabled when one adds --enable-burst to the
pre-compilation ./configure  of netperf (one doesn't have to
recompile netserver).  However, if one is also looking at latency
statistics via the -j option in the top-of-trunk, or simply at the
histogram with --enable-histogram on the ./configure and a verbosity
level of 2 (global -v 2) then one wants the very top of trunk
netperf from:
   
   Hi,
   
   I have constructed a test where I run an un-paced  UDP_STREAM test in
   one guest and a paced omni rr test in another guest at the same time.
  
  Hmm, what is this supposed to measure?  Basically each time you run an
  un-paced UDP_STREAM you get some random load on the network.
  You can't tell what it was exactly, only that it was between
  the send and receive throughput.
 
 Rick mentioned in another email that I messed up my test parameters a bit,
 so I will re-run the tests, incorporating his suggestions.
 
 What I was attempting to measure was the effect of an unpaced UDP_STREAM
 on the latency of more moderated traffic. Because I am interested in
 what effect an abusive guest has on other guests and how that my be
 mitigated.
 
 Could you suggest some tests that you feel are more appropriate?

Yes. To refraze my concern in these terms, besides the malicious guest
you have another software in host (netperf) that interferes with
the traffic, and it cooperates with the malicious guest.
Right?

IMO for a malicious guest you would send
UDP packets that then get dropped by the host.

For example block netperf in host so that
it does not consume packets from the socket.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-22 Thread Michael S. Tsirkin

On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote:
 When doing device assignment, we use cpu_register_physical_memory() to
 directly map the qemu mmap of the device resource into the address
 space of the guest.  The unadvertised feature of the register physical
 memory code path on kvm, at least for this type of mapping, is that it
 needs to allocate an index from a small, fixed array of memory slots.
 Even better, if it can't get an index, the code aborts deep in the
 kvm specific bits, preventing the caller from having a chance to
 recover.
 
 It's really easy to hit this by hot adding too many assigned devices
 to a guest (pretty easy to hit with too many devices at instantiation
 time too, but the abort is slightly more bearable there).
 
 I'm assuming it's pretty difficult to make the memory slot array
 dynamically sized.  If that's not the case, please let me know as
 that would be a much better solution.
 
 I'm not terribly happy with the solution in this series, it doesn't
 provide any guarantees whether a cpu_register_physical_memory() will
 succeed, only slightly better educated guesses.
 
 Are there better ideas how we could solve this?  Thanks,
 
 Alex

Put the table in qemu memory, make kvm access it with copy from/to user?
It can then be any size ...

 ---
 
 Alex Williamson (2):
   device-assignment: Count required kvm memory slots
   kvm: Allow querying free slots
 
 
  hw/device-assignment.c |   59 
 +++-
  hw/device-assignment.h |3 ++
  kvm-all.c  |   16 +
  kvm.h  |2 ++
  4 files changed, 79 insertions(+), 1 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-23 Thread Michael S. Tsirkin

On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
 On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
  On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
   On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
 [ Trimmed Eric from CC list as vger was complaining that it is too 
 long ]
 
 On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
  So it won't be all that simple to implement well, and before we 
  try,
  I'd like to know whether there are applications that are helped
  by it. For example, we could try to measure latency at various
  pps and see whether the backpressure helps. netperf has -b, -w
  flags which might help these measurements.
  
  Those options are enabled when one adds --enable-burst to the
  pre-compilation ./configure  of netperf (one doesn't have to
  recompile netserver).  However, if one is also looking at latency
  statistics via the -j option in the top-of-trunk, or simply at the
  histogram with --enable-histogram on the ./configure and a verbosity
  level of 2 (global -v 2) then one wants the very top of trunk
  netperf from:
 
 Hi,
 
 I have constructed a test where I run an un-paced  UDP_STREAM test in
 one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.
   
   Rick mentioned in another email that I messed up my test parameters a bit,
   so I will re-run the tests, incorporating his suggestions.
   
   What I was attempting to measure was the effect of an unpaced UDP_STREAM
   on the latency of more moderated traffic. Because I am interested in
   what effect an abusive guest has on other guests and how that my be
   mitigated.
   
   Could you suggest some tests that you feel are more appropriate?
  
  Yes. To refraze my concern in these terms, besides the malicious guest
  you have another software in host (netperf) that interferes with
  the traffic, and it cooperates with the malicious guest.
  Right?
 
 Yes, that is the scenario in this test.

Yes but I think that you want to put some controlled load on host.
Let's assume that we impove the speed somehow and now you can push more
bytes per second without loss.  Result might be a regression in your
test because you let the guest push as much as it can and suddenly it
can push more data through.  OTOH with packet loss the load on host is
anywhere in between send and receive throughput: there's no easy way to
measure it from netperf: the earlier some buffers overrun, the earlier
the packets get dropped and the less the load on host.

This is why I say that to get a specific
load on host you want to limit the sender
to a specific BW and then either
- make sure packet loss % is close to 0.
- make sure packet loss % is close to 100%.

  IMO for a malicious guest you would send
  UDP packets that then get dropped by the host.
  
  For example block netperf in host so that
  it does not consume packets from the socket.
 
 I'm more interested in rate-limiting netperf than blocking it.

Well I mean netperf on host.

 But in any case, do you mean use iptables or tc based on
 classification made by net_cls?

Just to block netperf you can send it SIGSTOP :)

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin

On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?

If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

case SO_RCVBUF:
/* Don't error on this BSD doesn't and if you think
   about it this is right. Otherwise apps have to
   play 'guess the biggest size' games. RCVBUF/SNDBUF
   are treated in BSD as hints */

if (val  sysctl_rmem_max)
val = sysctl_rmem_max;
set_rcvbuf: 
sk-sk_userlocks |= SOCK_RCVBUF_LOCK;

/*
 * We double it on the way in to account for
 * struct sk_buff etc. overhead.   Applications
 * assume that the SO_RCVBUF setting they make will
 * allow that much actual data to be received on that
 * socket.
 *
 * Applications are unaware that struct sk_buff and
 * other overheads allocate from the receive buffer
 * during socket buffer allocation. 
 *
 * And after considering the possible alternatives,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */ 
if ((val * 2)  SOCK_MIN_RCVBUF)
sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
else
sk-sk_rcvbuf = val * 2;

and

/*  
 * Since sk_rmem_alloc sums skb-truesize, even a small frame might need
 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */ 
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


  Both will be drops at the socket buffer, albeit for for
 different reasons.  The too small socket buffer version though
 doesn't require one remember to wake the netserver in time to have
 it send results back to netperf without netperf tossing-up an error
 and not reporting any statistics.
 
 Also, netperf has a no control connection mode where you can, in
 effect cause it to send UDP datagrams out into the void - I put it
 there to allow folks to test against the likes of echo discard and
 chargen services but it may have a use here.  Requires that one
 specify the destination IP and port for the data connection
 explicitly via the test-specific options.  In that mode the only
 stats reported are those local to netperf rather than netserver.

Ah, sounds perfect.

 happy benchmarking,
 
 rick jones

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin

On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote:
 Michael S. Tsirkin wrote:
 On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?
 
 
 If you could set, it, yes. But at least linux ignores
 any value substantially smaller than 1K, and then
 multiplies that by 2:
 
 case SO_RCVBUF:
 /* Don't error on this BSD doesn't and if you think
about it this is right. Otherwise apps have to
play 'guess the biggest size' games. RCVBUF/SNDBUF
are treated in BSD as hints */
 
 if (val  sysctl_rmem_max)
 val = sysctl_rmem_max;
 set_rcvbuf: sk-sk_userlocks |=
 SOCK_RCVBUF_LOCK;
 
 /*
  * We double it on the way in to account for
  * struct sk_buff etc. overhead.   Applications
  * assume that the SO_RCVBUF setting they make will
  * allow that much actual data to be received on that
  * socket.
  *
  * Applications are unaware that struct sk_buff and
  * other overheads allocate from the receive buffer
  * during socket buffer allocation.
 *
  * And after considering the possible alternatives,
  * returning the value we actually used in getsockopt
  * is the most desirable behavior.
  */ if ((val * 2) 
 SOCK_MIN_RCVBUF)
 sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
 else
 sk-sk_rcvbuf = val * 2;
 
 and
 
 /*   * Since sk_rmem_alloc sums skb-truesize,
 even a small frame might need
  * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
  */ #define SOCK_MIN_RCVBUF (2048 + sizeof(struct
 sk_buff))
 
 Pity - seems to work back on 2.6.26:

Hmm, that code is there at least as far back as 2.6.12.

 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 2882334  02361.17
256   10.00   0  0.00
 
 raj@tardy:~/netperf2_trunk$ uname -a
 Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 
 GNU/Linux
 
 Still, even with that (or SIGSTOP) we don't really know where the
 packets were dropped right?  There is no guarantee they weren't
 dropped before they got to the socket buffer
 
 happy benchmarking,
 rick jones

Right. Better send to a port with no socket listening there,
that would drop the packet at an early (if not at the earliest
possible)  opportunity.

 PS - here is with a -S 1024 option:
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 1679269  01375.64
   2048   10.00 1490662   1221.13
 
 showing that there is a decent chance that many of the frames were
 dropped at the socket buffer, but not all - I suppose I could/should
 be checking netstat stats... :)
 
 And just a little more, only because I was curious :)
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 1869134  0 384.29
 262142   10.00 1869134384.29
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 3076363  0 632.49
256   10.00   0  0.00
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 07:41:32AM -0700, Alex Williamson wrote:
 On Tue, 2011-01-25 at 08:36 +0100, Jan Kiszka wrote:
  On 2011-01-25 06:37, Alex Williamson wrote:
   On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote:
   I'll look at how we might be
   able to allocate slots on demand.  Thanks,
   
   Here's a first cut just to see if this looks agreeable.  This allows the
   slot array to grow on demand.  This works with current userspace, as
   well as userspace trivially modified to double KVMState.slots and
   hotplugging enough pci-assign devices to exceed the previous limit (w/ 
   w/o ept).  Hopefully I got the rcu bits correct.  Does this look like
   the right path?  If so, I'll work on removing the fixed limit from
   userspace next.  Thanks,
   
   Alex
   
   
   kvm: Allow memory slot array to grow on demand
   
   Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
   to grow on demand.  Private slots are now allocated at the
   front instead of the end.  Only x86 seems to use private slots,
  
  Hmm, doesn't current user space expect slots 8..11 to be the private
  ones and wouldn't it cause troubles if slots 0..3 are suddenly reserved?
 
 The private slots aren't currently visible to userspace, they're
 actually slots 32..35.  The patch automatically increments user passed
 slot ids so userspace has it's own zero-based view of the array.
 Frankly, I don't understand why userspace reserves slots 8..11, is this
 compatibility with older kernel implementations?
 
   so this is now zero for all other archs.  The memslots pointer
   is already updated using rcu, so changing the size off the
   array when it's replaces is straight forward.  x86 also keeps
   a bitmap of slots used by a kvm_mmu_page, which requires a
   shadow tlb flush whenever we increase the number of slots.
   This forces the pages to be rebuilt with the new bitmap size.
  
  Is it possible for user space to increase the slot number to ridiculous
  amounts (at least as far as kmalloc allows) and then trigger a kernel
  walk through them in non-preemptible contexts? Just wondering, I haven't
  checked all contexts of functions like kvm_is_visible_gfn yet.
  
  If yes, we should already switch to rbtree or something like that.
  Otherwise that may wait a bit, but probably not too long.
 
 Yeah, Avi has brought up the hole that userspace can exploit this
 interface with these changes.  However, for 99+% of users, this change
 leaves the slot array at about the same size, or makes it smaller.  Only
 huge, scale-out guests would probably even see a doubling of slots (my
 guest with 14 82576 VFs uses 48 slots).  On the kernel side, I think we
 can safely save a tree implementation as a later optimization should we
 determine it's necessary.  We'll have to see how the userspace side
 matches to figure out what's best there.  Thanks,
 
 Alex

Also, is there any time where we need to scan all slots
on data path?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 12:20:03PM +0200, Avi Kivity wrote:
 On 01/24/2011 11:32 AM, Marcelo Tosatti wrote:
 On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote:
   When doing device assignment, we use cpu_register_physical_memory() to
   directly map the qemu mmap of the device resource into the address
   space of the guest.  The unadvertised feature of the register physical
   memory code path on kvm, at least for this type of mapping, is that it
   needs to allocate an index from a small, fixed array of memory slots.
   Even better, if it can't get an index, the code aborts deep in the
   kvm specific bits, preventing the caller from having a chance to
   recover.
 
   It's really easy to hit this by hot adding too many assigned devices
   to a guest (pretty easy to hit with too many devices at instantiation
   time too, but the abort is slightly more bearable there).
 
   I'm assuming it's pretty difficult to make the memory slot array
   dynamically sized.  If that's not the case, please let me know as
   that would be a much better solution.
 
 Its not difficult to either increase the maximum number (defined as
 32 now in both qemu and kernel) of static slots, or support dynamic
 increases, if it turns out to be a performance issue.
 
 
 We can't make it unbounded in the kernel, since a malicious user
 could start creating an infinite amount of memory slots, pinning
 unbounded kernel memory.

How about keeping the slots in userspace memory, access them with copy
from user?

 If we make the limit much larger, we should start to think about
 efficiency.  Every mmio vmexit is currently a linear scan of the
 memory slot table, which is efficient at a small number of slots,
 but not at a large number.  We could conceivably encode the no
 slot information into a bit in the not-present spte.

OK, but the slots that Alex here wants to use are presumably
mostly not resulting in a pagefault at all, right?
Can we split such guys out so they don't slow mmio lookup?
Maybe keep *these* in userspace memory.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
 For the other lookups, which we
 believe will succeed, we can assume the probablity of a match is
 related to the slot size, and sort the slots by page count.

Unlikely to be true for assigned device BARs.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 04:58:41PM +0200, Avi Kivity wrote:
 On 01/25/2011 04:55 PM, Michael S. Tsirkin wrote:
 
   We can't make it unbounded in the kernel, since a malicious user
   could start creating an infinite amount of memory slots, pinning
   unbounded kernel memory.
 
 How about keeping the slots in userspace memory, access them with copy
 from user?
 
 Some of the data is validated by the kernel, so it needs a kernel
 copy.  Other fields are completely internal to the kernel.
 
   If we make the limit much larger, we should start to think about
   efficiency.  Every mmio vmexit is currently a linear scan of the
   memory slot table, which is efficient at a small number of slots,
   but not at a large number.  We could conceivably encode the no
   slot information into a bit in the not-present spte.
 
 OK, but the slots that Alex here wants to use are presumably
 mostly not resulting in a pagefault at all, right?
 Can we split such guys out so they don't slow mmio lookup?
 Maybe keep *these* in userspace memory.
 
 The algorithm is:
 
   page fault:
 for each slot:
if addr in slot:
   populate spte
   return
 # no slot matches
 mmio
 
 so we have to try out all slots before we find out none of them are needed.

I see now. Yes, a flag in spte would help.  changes in slots would then
have to update all these flags.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
 On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
   For the other lookups, which we
   believe will succeed, we can assume the probablity of a match is
   related to the slot size, and sort the slots by page count.
 
 Unlikely to be true for assigned device BARs.
 
 We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
 lots of small slots for BARs and such.
 
 The vast majority of faults will either touch one of the two largest
 slots, or will miss all slots.  Relatively few will hit the small
 slots.

Not if you are using one of the assigned devices and don't
do any faults on memory :)

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-25 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 07:34:18PM +0200, Avi Kivity wrote:
 On 01/25/2011 05:23 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 04:58:41PM +0200, Avi Kivity wrote:
   On 01/25/2011 04:55 PM, Michael S. Tsirkin wrote:
   
  We can't make it unbounded in the kernel, since a malicious user
  could start creating an infinite amount of memory slots, pinning
  unbounded kernel memory.
   
   How about keeping the slots in userspace memory, access them with copy
   from user?
 
   Some of the data is validated by the kernel, so it needs a kernel
   copy.  Other fields are completely internal to the kernel.
 
  If we make the limit much larger, we should start to think about
  efficiency.  Every mmio vmexit is currently a linear scan of the
  memory slot table, which is efficient at a small number of slots,
  but not at a large number.  We could conceivably encode the no
  slot information into a bit in the not-present spte.
   
   OK, but the slots that Alex here wants to use are presumably
   mostly not resulting in a pagefault at all, right?
   Can we split such guys out so they don't slow mmio lookup?
   Maybe keep *these* in userspace memory.
 
   The algorithm is:
 
 page fault:
   for each slot:
  if addr in slot:
 populate spte
 return
   # no slot matches
   mmio
 
   so we have to try out all slots before we find out none of them are 
  needed.
 
 I see now. Yes, a flag in spte would help.  changes in slots would then
 have to update all these flags.
 
 That's easy, we drop all sptes.

Ah, right. Hmm cpu has no flag to distinguish mmio sptes somehow already?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin

On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
 On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
   On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
   On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
  For the other lookups, which we
  believe will succeed, we can assume the probablity of a match is
  related to the slot size, and sort the slots by page count.
   
   Unlikely to be true for assigned device BARs.
 
   We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
   lots of small slots for BARs and such.
 
   The vast majority of faults will either touch one of the two largest
   slots, or will miss all slots.  Relatively few will hit the small
   slots.
 
 Not if you are using one of the assigned devices and don't
 do any faults on memory :)
 
 It's impossible not to fault on memory.

No I mean the RAM.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin

On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote:
 On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:
 On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
   On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
   On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
  On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
  For the other lookups, which we
  believe will succeed, we can assume the probablity of a match 
  is
  related to the slot size, and sort the slots by page count.
  
  Unlikely to be true for assigned device BARs.
   
  We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
  lots of small slots for BARs and such.
   
  The vast majority of faults will either touch one of the two largest
  slots, or will miss all slots.  Relatively few will hit the small
  slots.
   
   Not if you are using one of the assigned devices and don't
   do any faults on memory :)
 
   It's impossible not to fault on memory.
 
 No I mean the RAM.
 
 No idea what you mean.  It's impossible not to fault on RAM, either
 (unless you don't use it at all).

I just mean that once you fault you map sptes and then you can use them
without exits.  mmio will cause exits each time. Right?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin

On Wed, Jan 26, 2011 at 11:54:21AM +0200, Avi Kivity wrote:
 On 01/26/2011 11:39 AM, Michael S. Tsirkin wrote:
 On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote:
   On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:
   On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
  On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
  On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
   For the other lookups, which we
   believe will succeed, we can assume the probablity of a 
  match is
   related to the slot size, and sort the slots by page 
  count.
  
  Unlikely to be true for assigned device BARs.
  
  We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, 
  and
  lots of small slots for BARs and such.
  
  The vast majority of faults will either touch one of the two 
  largest
  slots, or will miss all slots.  Relatively few will hit the 
  small
  slots.
  
  Not if you are using one of the assigned devices and don't
  do any faults on memory :)
   
  It's impossible not to fault on memory.
   
   No I mean the RAM.
 
   No idea what you mean.  It's impossible not to fault on RAM, either
   (unless you don't use it at all).
 
 I just mean that once you fault you map sptes and then you can use them
 without exits.  mmio will cause exits each time. Right?
 
 The swapper scanning sptes, ksmd, khugepaged, and swapping can all
 cause a page to be unmapped.  Though it should certainly happen with
 a much lower frequency than mmio.

Right. That's why I say that sorting by size might not be optimal.
Maybe a cache ...

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-26 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
 
 I am working on a KVM network performance issue found in our lab running
 the DayTrader benchmark.  The benchmark throughput takes a significant hit
 when running the application server in a KVM guest verses on bare metal.
 We have dug into the problem and found that DayTrader's use of small
 packets exposes KVM's overhead of handling network packets.  I have been
 able to reproduce the performance hit with a simpler setup using the
 netperf benchmark with the TCP_RR test and the request and response sizes
 set to 256 bytes.  I run the benchmark between two physical systems, each
 using a 1GB link.  In order to get the maximum throughput for the system I
 have to run 100 instances of netperf.  When I run the netserver processes
 in a guest, I see a maximum throughput that is 51% of what I get if I run
 the netserver processes directly on the host.  The CPU utilization in the
 guest is only 85% at maximum throughput, whereas it is 100% on bare metal.
 
 The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
 I run netperf on the host I boot the host with maxcpus=2 on the kernel
 command line.  The host is running the current KVM upstream kernel along
 with the current upstream qemu.  Here is the qemu command used to launch
 the guest:
 /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 
 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
  -drive file=/dev/virt/WAS,if=virtio,index=1 -net 
 nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev 
 tap,id=nic0,vhost=on -smp 2
 -vnc :1 -monitor telnet::4499,server,nowait -serial 
 telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
 
 We have tried various proposed fixes, each with varying amounts of success.
 One such fix was to add code to the vhost thread such that when it found
 the work queue empty it wouldn't just exit the thread but rather would
 delay for 50 microseconds and then recheck the queue.  If there was work on
 the queue it would loop back and process it, else it would exit the thread.
 The change got us a 13% improvement in the DayTrader throughput.
 
 Running the same netperf configuration on the same hardware but using a
 different hypervisor gets us significantly better throughput numbers.   The
 guest on that hypervisor runs at 100% CPU utilization.  The various fixes
 we have tried have not gotten us close to the throughput seen on the other
 hypervisor.  I'm looking for ideas/input from the KVM experts on how to
 make KVM perform better when handling small packets.
 
 Thanks,
 Steve

I am seeing a similar problem, and am trying to fix that.
My current theory is that this is a variant of a receive livelock:
if the application isn't fast enough to process
incoming data, the guest net stack switches
from prequeue to backlog handling.

One thing I noticed is that locking the vhost thread
and the vcpu to the same physical CPU almost doubles the
bandwidth.  Can you confirm that in your setup?

My current guess is that when we lock both to
a single CPU, netperf in guest gets scheduled
slowing down the vhost thread in the host.

I also noticed that this specific workload
performs better with vhost off: presumably
we are loading the guest less.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 11:21:47AM +0200, Avi Kivity wrote:
 On 01/26/2011 02:08 PM, Michael S. Tsirkin wrote:
   
   I just mean that once you fault you map sptes and then you can use them
   without exits.  mmio will cause exits each time. Right?
 
   The swapper scanning sptes, ksmd, khugepaged, and swapping can all
   cause a page to be unmapped.  Though it should certainly happen with
   a much lower frequency than mmio.
 
 Right. That's why I say that sorting by size might not be optimal.
 Maybe a cache ...
 
 Why would it not be optimal?
 
 If you have 16GB RAM in two slots and a few megabytes here and there
 scattered in some slots, you have three orders of magnitudes to
 spare.

Yes but you might not be accessing all that RAM.
Maybe your workset is just tens of megabytes.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 11:26:19AM +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 27, 2011 at 11:21:47AM +0200, Avi Kivity wrote:
  On 01/26/2011 02:08 PM, Michael S. Tsirkin wrote:

I just mean that once you fault you map sptes and then you can use them
without exits.  mmio will cause exits each time. Right?
  
The swapper scanning sptes, ksmd, khugepaged, and swapping can all
cause a page to be unmapped.  Though it should certainly happen with
a much lower frequency than mmio.
  
  Right. That's why I say that sorting by size might not be optimal.
  Maybe a cache ...
  
  Why would it not be optimal?
  
  If you have 16GB RAM in two slots and a few megabytes here and there
  scattered in some slots, you have three orders of magnitudes to
  spare.
 
 Yes but you might not be accessing all that RAM.
 Maybe your workset is just tens of megabytes.

Anyway, what I am saying this is all just guesswork.
Some kind of cache sounds, to me, like a better
approach if we have a ton of slots.


  -- 
  error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 11:28:12AM +0200, Avi Kivity wrote:
 On 01/27/2011 11:26 AM, Michael S. Tsirkin wrote:
   Right. That's why I say that sorting by size might not be optimal.
   Maybe a cache ...
 
   Why would it not be optimal?
 
   If you have 16GB RAM in two slots and a few megabytes here and there
   scattered in some slots, you have three orders of magnitudes to
   spare.
 
 Yes but you might not be accessing all that RAM.
 Maybe your workset is just tens of megabytes.
 
 Great, then you fault it in and never touch the slots array again.

Yep. Except for emulated mmio :)

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_blk: allow re-reading config space at runtime

2011-01-27 Thread Michael S. Tsirkin

On Fri, Jan 14, 2011 at 05:01:37PM +0100, Christoph Hellwig wrote:
 Wire up the virtio_driver config_changed method to get notified about
 config changes raised by the host.  For now we just re-read the device
 size to support online resizing of devices, but once we add more
 attributes that might be changeable they could be added as well.
 
 Note that the config_changed method is called from irq context, so
 we'll have to use the workqueue infrastructure to provide us a proper
 user context for our changes.
 
 Signed-off-by: Christoph Hellwig h...@lst.de
 
 Index: xfs/drivers/block/virtio_blk.c
 ===
 --- xfs.orig/drivers/block/virtio_blk.c   2011-01-13 18:17:23.730254665 
 +0100
 +++ xfs/drivers/block/virtio_blk.c2011-01-14 16:57:50.572032906 +0100
 @@ -6,10 +6,12 @@
  #include linux/virtio.h
  #include linux/virtio_blk.h
  #include linux/scatterlist.h
 +#include linux/string_helpers.h
  
  #define PART_BITS 4
  
  static int major, index;
 +struct workqueue_struct *virtblk_wq;
  
  struct virtio_blk
  {
 @@ -42,6 +44,11 @@ struct virtblk_req
   u8 status;
  };
  
 +struct virtblk_config_change {
 + struct virtio_device *vdev;
 + struct work_struct work;
 +};
 +
  static void blk_done(struct virtqueue *vq)
  {
   struct virtio_blk *vblk = vq-vdev-priv;
 @@ -291,6 +298,57 @@ static ssize_t virtblk_serial_show(struc
  }
  DEVICE_ATTR(serial, S_IRUGO, virtblk_serial_show, NULL);
  
 +static void virtblk_config_changed_work(struct work_struct *work)
 +{
 + struct virtblk_config_change *cfg =
 + container_of(work, struct virtblk_config_change, work);
 + struct virtio_device *vdev = cfg-vdev;
 + struct virtio_blk *vblk = vdev-priv;
 + struct request_queue *q = vblk-disk-queue;
 + char cap_str_2[10], cap_str_10[10];
 + u64 capacity, size;
 +
 + /* Host must always specify the capacity. */
 + vdev-config-get(vdev, offsetof(struct virtio_blk_config, capacity),
 +   capacity, sizeof(capacity));
 +
 + /* If capacity is too big, truncate with warning. */
 + if ((sector_t)capacity != capacity) {
 + dev_warn(vdev-dev, Capacity %llu too large: truncating\n,
 +  (unsigned long long)capacity);
 + capacity = (sector_t)-1;
 + }
 +
 + size = capacity * queue_logical_block_size(q);
 + string_get_size(size, STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
 + string_get_size(size, STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
 +
 + dev_notice(vdev-dev,
 +   new size: %llu %d-byte logical blocks (%s/%s)\n,
 +   (unsigned long long)capacity,
 +   queue_logical_block_size(q),
 +   cap_str_10, cap_str_2);
 +
 + set_capacity(vblk-disk, capacity);
 +
 +}
 +
 +static void virtblk_config_changed(struct virtio_device *vdev)
 +{
 + struct virtblk_config_change *cfg;
 +
 + cfg = kmalloc(sizeof(*cfg), GFP_ATOMIC);
 + if (!cfg) {
 + dev_info(vdev-dev, skipping config change\n);
 + return;
 + }
 +
 + cfg-vdev = vdev;
 + INIT_WORK(cfg-work, virtblk_config_changed_work);
 + queue_work(virtblk_wq, cfg-work);

This needs to be flushed on device removal, I think,
otherwise the vdev pointer will go stale.

 +}
 +
 +

Two empty lines :)

  static int __devinit virtblk_probe(struct virtio_device *vdev)
  {
   struct virtio_blk *vblk;
 @@ -508,27 +566,47 @@ static unsigned int features[] = {
   * Use __refdata to avoid this warning.
   */
  static struct virtio_driver __refdata virtio_blk = {
 - .feature_table = features,
 - .feature_table_size = ARRAY_SIZE(features),
 - .driver.name =  KBUILD_MODNAME,
 - .driver.owner = THIS_MODULE,
 - .id_table = id_table,
 - .probe =virtblk_probe,
 - .remove =   __devexit_p(virtblk_remove),
 + .feature_table  = features,
 + .feature_table_size = ARRAY_SIZE(features),
 + .driver.name= KBUILD_MODNAME,
 + .driver.owner   = THIS_MODULE,
 + .id_table   = id_table,
 + .probe  = virtblk_probe,
 + .remove = __devexit_p(virtblk_remove),
 + .config_changed = virtblk_config_changed,
  };
  
  static int __init init(void)
  {
 + int error;
 +
 + virtblk_wq = alloc_workqueue(md_misc, 0, 0);
 + if (!virtblk_wq)
 + return -ENOMEM;
 +
   major = register_blkdev(0, virtblk);
 - if (major  0)
 - return major;
 - return register_virtio_driver(virtio_blk);
 + if (major  0) {
 + error = major;
 + goto out_destroy_workqueue;
 + }
 +
 + error = register_virtio_driver(virtio_blk);
 + if (error)
 + goto out_unregister_blkdev;
 + return 0;
 +
 +out_unregister_blkdev:
 + unregister_blkdev(major, virtblk);
 +out_destroy_workqueue:
 +

Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 10:44:34AM -0800, Shirley Ma wrote:
 On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote:
  I am seeing a similar problem, and am trying to fix that.
  My current theory is that this is a variant of a receive livelock:
  if the application isn't fast enough to process
  incoming data, the guest net stack switches
  from prequeue to backlog handling.
  
  One thing I noticed is that locking the vhost thread
  and the vcpu to the same physical CPU almost doubles the
  bandwidth.  Can you confirm that in your setup?
  
  My current guess is that when we lock both to
  a single CPU, netperf in guest gets scheduled
  slowing down the vhost thread in the host.
  
  I also noticed that this specific workload
  performs better with vhost off: presumably
  we are loading the guest less. 
 
 I found similar issue for small message size TCP_STREAM test when guest
 as TX. I found when I slow down TX, the BW performance will be doubled
 for 1K to 4K message size.
 
 Shirley

Interesting. In particular running vhost and the transmitting guest
on the same host would have the effect of slowing down TX.
Does it double the BW for you too?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 11:09:00AM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote:
  Interesting. In particular running vhost and the transmitting guest
  on the same host would have the effect of slowing down TX.
  Does it double the BW for you too?
  
 
 Running vhost and TX guest on the same host seems not good enough to
 slow down TX. In order to gain the double even triple BW for guest TX to
 local host I still need to play around, so 1K message size, BW is able
 to increase from 2.XGb/s to 6.XGb/s.
 
 Thanks
 Shirley

Well slowing down the guest does not sound hard - for example we can
request guest notifications, or send extra interrupts :)
A slightly more sophisticated thing to try is to
poll the vq a bit more aggressively.
For example if we handled some requests and now tx vq is empty,
reschedule and yeild. Worth a try?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 11:45:47AM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote:
  Well slowing down the guest does not sound hard - for example we can
  request guest notifications, or send extra interrupts :)
  A slightly more sophisticated thing to try is to
  poll the vq a bit more aggressively.
  For example if we handled some requests and now tx vq is empty,
  reschedule and yeild. Worth a try?
 
 I used dropping packets in high level to slow down TX.
 I am still
 thinking what's the right the approach here. 

Interesting. Could this is be a variant of the now famuous bufferbloat then?

I guess we could drop some packets if we see we are not keeping up. For
example if we see that the ring is  X% full, we could quickly complete
Y% without transmitting packets on. Or maybe we should drop some bytes
not packets.

 
 Requesting guest notification and extra interrupts is what we want to
 avoid to reduce VM exits for saving CPUs. I don't think it's good.

Yes but how do you explain regression?
One simple theory is that guest net stack became faster
and so the host can't keep up.


 
 By polling the vq a bit more aggressively, you meant vhost, right?
 
 Shirley

Yes.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-28 Thread Michael S. Tsirkin

On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
   Interesting. Could this is be a variant of the now famuous
  bufferbloat then?
  
  Sigh, bufferbloat is the new global warming... :-/ 
 
 Yep, some places become colder, some other places become warmer; Same as
 BW results, sometimes faster, sometimes slower. :)
 
 Shirley

OK, so thinking about it more, maybe the issue is this:
tx becomes full. We process one request and interrupt the guest,
then it adds one request and the queue is full again.

Maybe the following will help it stabilize?
By itself it does nothing, but if you set
all the parameters to a huge value we will
only interrupt when we see an empty ring.
Which might be too much: pls try other values
in the middle: e.g. make bufs half the ring,
or bytes some small value, or packets some
small value etc.

Warning: completely untested.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aac05bc..6769cdc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,13 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x8
 
+int tx_bytes_coalesce = 0;
+module_param(tx_bytes_coalesce, int, 0644);
+int tx_bufs_coalesce = 0;
+module_param(tx_bufs_coalesce, int, 0644);
+int tx_packets_coalesce = 0;
+module_param(tx_packets_coalesce, int, 0644);
+
 enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+   int bytes_coalesced = 0;
+   int bufs_coalesced = 0;
+   int packets_coalesced = 0;
 
/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq-private_data, 1);
@@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
+   packets_coalesced += 1;
+   bytes_coalesced += len;
+   bufs_coalesced += in;
+   if (unlikely(packets_coalesced  tx_packets_coalesce ||
+bytes_coalesced  tx_bytes_coalesce ||
+bufs_coalesced  tx_bufs_coalesce))
+   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   else
+   vhost_add_used(vq, head, 0);
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
}
}
 
+   if (likely(packets_coalesced  tx_packets_coalesce ||
+  bytes_coalesced  tx_bytes_coalesce ||
+  bufs_coalesced  tx_bufs_coalesce))
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Untested, I'll send a pull request later after
some testing assuming we've resolved the
command line comments to everyone's satisfaction.

Changes since v1:
added vhostforce to enable vhost for non-MSI guests

 hw/vhost.c  |   16 +++-
 hw/vhost.h  |3 ++-
 hw/vhost_net.c  |8 +---
 hw/vhost_net.h  |2 +-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |6 +-
 hw/virtio.h |4 +++-
 net/tap.c   |6 --
 qemu-options.hx |4 +++-
 9 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..72df75f 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
   0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
 {
 uint64_t features;
 int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
 hdev-log_enabled = false;
 hdev-started = false;
 cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
 return 0;
 fail:
 r = -errno;
@@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 goto fail;
 }
 
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
+   hdev-force);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
 goto fail_notifiers;
 }
 
@@ -686,7 +690,8 @@ fail_vq:
 }
 fail_mem:
 fail_features:
-vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 fail_notifiers:
 fail:
 return r;
@@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 }
 vhost_client_sync_dirty_bitmap(hdev-client, 0,
(target_phys_addr_t)~0x0ull);
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 if (r  0) {
 fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
 fflush(stderr);
diff --git a/hw/vhost.h b/hw/vhost.h
index 86dd834..d5d4d86 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -38,9 +38,10 @@ struct vhost_dev {
 bool log_enabled;
 vhost_log_chunk_t *log;
 unsigned long long log_size;
+bool force;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index c068be1..4f57271 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
 }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
 {
 int r;
 struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, 
int devfd)
 (1  VHOST_NET_F_VIRTIO_NET_HDR);
 net-backend = r;
 
-r = vhost_dev_init(net-dev, devfd);
+r = vhost_dev_init(net-dev, devfd, force);
 if (r  0) {
 goto fail;
 }
@@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net)
 qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
 {
return NULL;
 }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 6c18ff7..8435094 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote:
 On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
  When MSI is off, each interrupt needs to be bounced through the io
  thread when it's set/cleared, so vhost-net causes more context switches and
  higher CPU utilization than userspace virtio which handles networking in
  the same thread.
  
  We'll need to fix this by adding level irq support in kvm irqfd,
  for now disable vhost-net in these configurations.
  
  Added a vhostforce flag to force vhost-net back on.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  Untested, I'll send a pull request later after
  some testing assuming we've resolved the
  command line comments to everyone's satisfaction.
  
  Changes since v1:
  added vhostforce to enable vhost for non-MSI guests
 
 I'm still not thrilled with the errno hack.  If this is really a generic
 problem, we should better architect the set_guest_notifiers() callback,
 the force option feels like a band-aide.
  I'm still curious if we can't
 be more dynamic about how we setup the set_guest_notifiers() function
 pointer so it only gets set if (force || msix_enabled()) and we can
 limit some of the impact here to vhost.  Thanks,
 
 Alex

I'm not entirely happy with the API either, but I'm sure
playing with the function pointer is not the solution
we are looking for:

msix is enabled/disabled dynamically.  force is static but it is
something net backend knows about, virtio pci doesn't.
So I suspect modifying the callback on the fly will become
messy quite quickly.

Well, this API will be easy to tweak after the fact, there's a single
user after all.

   hw/vhost.c  |   16 +++-
   hw/vhost.h  |3 ++-
   hw/vhost_net.c  |8 +---
   hw/vhost_net.h  |2 +-
   hw/virtio-net.c |6 --
   hw/virtio-pci.c |6 +-
   hw/virtio.h |4 +++-
   net/tap.c   |6 --
   qemu-options.hx |4 +++-
   9 files changed, 38 insertions(+), 17 deletions(-)
  
  diff --git a/hw/vhost.c b/hw/vhost.c
  index 6082da2..72df75f 100644
  --- a/hw/vhost.c
  +++ b/hw/vhost.c
  @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev 
  *dev,
 0, virtio_queue_get_desc_size(vdev, idx));
   }
   
  -int vhost_dev_init(struct vhost_dev *hdev, int devfd)
  +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
   {
   uint64_t features;
   int r;
  @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
   hdev-log_enabled = false;
   hdev-started = false;
   cpu_register_phys_memory_client(hdev-client);
  +hdev-force = force;
   return 0;
   fail:
   r = -errno;
  @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   goto fail;
   }
   
  -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
  +   hdev-force);
   if (r  0) {
  -fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   if (r != -EVIRTIO_DISABLED) {
  +   fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   }
   goto fail_notifiers;
   }
   
  @@ -686,7 +690,8 @@ fail_vq:
   }
   fail_mem:
   fail_features:
  -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
  +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
  +   hdev-force);
   fail_notifiers:
   fail:
   return r;
  @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   }
   vhost_client_sync_dirty_bitmap(hdev-client, 0,
  (target_phys_addr_t)~0x0ull);
  -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
  +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
  +   hdev-force);
   if (r  0) {
   fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
   fflush(stderr);
  diff --git a/hw/vhost.h b/hw/vhost.h
  index 86dd834..d5d4d86 100644
  --- a/hw/vhost.h
  +++ b/hw/vhost.h
  @@ -38,9 +38,10 @@ struct vhost_dev {
   bool log_enabled;
   vhost_log_chunk_t *log;
   unsigned long long log_size;
  +bool force;
   };
   
  -int vhost_dev_init(struct vhost_dev *hdev, int devfd);
  +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
   void vhost_dev_cleanup(struct vhost_dev *hdev);
   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
  diff --git a/hw/vhost_net.c b/hw/vhost_net.c
  index c068be1..4f57271 100644
  --- a/hw/vhost_net.c
  +++ b/hw/vhost_net.c
  @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
   }
   }
   
  -struct vhost_net

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 03:07:49PM -0700, Alex Williamson wrote:
 On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote:
  On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote:
   On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches 
and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Untested, I'll send a pull request later after
some testing assuming we've resolved the
command line comments to everyone's satisfaction.

Changes since v1:
added vhostforce to enable vhost for non-MSI guests
   
   I'm still not thrilled with the errno hack.  If this is really a generic
   problem, we should better architect the set_guest_notifiers() callback,
   the force option feels like a band-aide.
I'm still curious if we can't
   be more dynamic about how we setup the set_guest_notifiers() function
   pointer so it only gets set if (force || msix_enabled()) and we can
   limit some of the impact here to vhost.  Thanks,
   
   Alex
  
  I'm not entirely happy with the API either, but I'm sure
  playing with the function pointer is not the solution
  we are looking for:
  
  msix is enabled/disabled dynamically.  force is static but it is
  something net backend knows about, virtio pci doesn't.
  So I suspect modifying the callback on the fly will become
  messy quite quickly.
 
 Isn't there a callback to the device when msix is enabled/disabled?

Not at the moment. In qemu we have mask/unmask.

  Well, this API will be easy to tweak after the fact, there's a single
  user after all.
 
 Gosh, I wish I got to take that attitude when I submit code... ;)
 
 Alex

You are probably dealing with more important issues :)
If you have an internal API used in a single place, you get
to change it at will. If you tweak an API used all over the codebase,
you don't :)

 hw/vhost.c  |   16 +++-
 hw/vhost.h  |3 ++-
 hw/vhost_net.c  |8 +---
 hw/vhost_net.h  |2 +-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |6 +-
 hw/virtio.h |4 +++-
 net/tap.c   |6 --
 qemu-options.hx |4 +++-
 9 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..72df75f 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct 
vhost_dev *dev,
   0, virtio_queue_get_desc_size(vdev, 
idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
 {
 uint64_t features;
 int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int 
devfd)
 hdev-log_enabled = false;
 hdev-started = false;
 cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
 return 0;
 fail:
 r = -errno;
@@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 goto fail;
 }
 
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
+   hdev-force);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, 
-r);
+   }
 goto fail_notifiers;
 }
 
@@ -686,7 +690,8 @@ fail_vq:
 }
 fail_mem:
 fail_features:
-vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 fail_notifiers:
 fail:
 return r;
@@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 }
 vhost_client_sync_dirty_bitmap(hdev-client, 0,
(target_phys_addr_t)~0x0ull);
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, 
false);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 if (r  0) {
 fprintf(stderr, vhost guest notifier cleanup failed: %d\n, 
r);
 fflush(stderr);
diff --git a/hw/vhost.h b/hw

Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;

Correct.

 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.
 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
 
 
 Steve D.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote:
 On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
  
   OK, so thinking about it more, maybe the issue is this:
   tx becomes full. We process one request and interrupt the guest,
   then it adds one request and the queue is full again.
  
   Maybe the following will help it stabilize?
   By itself it does nothing, but if you set
   all the parameters to a huge value we will
   only interrupt when we see an empty ring.
   Which might be too much: pls try other values
   in the middle: e.g. make bufs half the ring,
   or bytes some small value, or packets some
   small value etc.
  
   Warning: completely untested.
  
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index aac05bc..6769cdc 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -32,6 +32,13 @@
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x8
  
   +int tx_bytes_coalesce = 0;
   +module_param(tx_bytes_coalesce, int, 0644);
   +int tx_bufs_coalesce = 0;
   +module_param(tx_bufs_coalesce, int, 0644);
   +int tx_packets_coalesce = 0;
   +module_param(tx_packets_coalesce, int, 0644);
   +
enum {
   VHOST_NET_VQ_RX = 0,
   VHOST_NET_VQ_TX = 1,
   @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
   int err, wmem;
   size_t hdr_size;
   struct socket *sock;
   +   int bytes_coalesced = 0;
   +   int bufs_coalesced = 0;
   +   int packets_coalesced = 0;
  
   /* TODO: check that we are running from vhost_worker? */
   sock = rcu_dereference_check(vq-private_data, 1);
   @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
 pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
   -  vhost_add_used_and_signal(net-dev, vq, head, 0);
  total_len += len;
   +  packets_coalesced += 1;
   +  bytes_coalesced += len;
   +  bufs_coalesced += in;
  
  Should this instead be:
bufs_coalesced += out;
  
  Perusing the code I see that earlier there is a check to see if in is not
  zero, and, if so, error out of the loop.  After the check, in is not
  touched until it is added to bufs_coalesced, effectively not changing
  bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
  below.
 
 Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
 
 I tried a simpler version of this patch without any tunables by
 delaying the signaling until we come out of the for loop.
 It definitely reduced the number of vmexits significantly for small message
 guest to host stream test and the throughput went up a little.
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 9b3ca10..5f9fae9 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
   if (err != len)
   pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
 - vhost_add_used_and_signal(net-dev, vq, head, 0);
 + vhost_add_used(vq, head, 0);
   total_len += len;
   if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
 @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
   }
   }
  
 + if (total_len  0)
 + vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
  }
  
 
  
  Or am I missing something?
  
   +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
   +  bytes_coalesced  tx_bytes_coalesce ||
   +  bufs_coalesced  tx_bufs_coalesce))
   + vhost_add_used_and_signal(net-dev, vq, head, 0);
   +  else
   + vhost_add_used(vq, head, 0);
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
 vhost_poll_queue(vq-poll);
 break;
  }
   }
  
   +   if (likely(packets_coalesced  tx_packets_coalesce ||
   + bytes_coalesced  tx_bytes_coalesce ||
   + bufs_coalesced  tx_bufs_coalesce))
   +  vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
}
 
 It is possible that we can miss signaling the guest even after
 processing a few pkts, if we don't hit any of these conditions.

Yes. It really should be
   if (likely(packets_coalesced  bytes_coalesced  bufs_coalesced))
  vhost_signal(net-dev, vq);

  
  
  Steve D.
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] vhost-net: 2.6.38 - warning fix

2011-02-01 Thread Michael S. Tsirkin

Please merge for 2.6.38. Not the most elegant fix,
but it does fix the noise in dmesg that interferes with
kvm debugging, and probably the best we can do for 2.6.38.

The following changes since commit fca540ab5f4718c6133f71f7be1793066008bf89:

  enc28j60: Fix reading of transmit status vector (2011-01-31 20:56:54 -0800)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net

Michael S. Tsirkin (1):
  vhost: rcu annotation fixup

 drivers/vhost/net.c   |9 +
 drivers/vhost/vhost.h |6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2 dontapply] vhost-net tx tuning

2011-02-01 Thread Michael S. Tsirkin

OK, so thinking about it more, maybe the issue is this:
tx becomes full. We process one request and interrupt the guest,
then it adds one request and the queue is full again.

Maybe the following will help it stabilize?  By default with it we will
only interrupt when we see an empty ring.
Which is liklely too much: pls try other values
in the middle: e.g. make bufs half the ring,
or bytes some small value like half ring * 200, or packets some
small value etc.

Set any one parameter to 0 to get current
behaviour (interrupt immediately when enabled).

Warning: completely untested.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aac05bc..6769cdc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,13 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x8
 
+int tx_bytes_coalesce = 10;
+module_param(tx_bytes_coalesce, int, 0644);
+int tx_bufs_coalesce = 10;
+module_param(tx_bufs_coalesce, int, 0644);
+int tx_packets_coalesce = 10;
+module_param(tx_packets_coalesce, int, 0644);
+
 enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+   int bytes_coalesced = 0;
+   int bufs_coalesced = 0;
+   int packets_coalesced = 0;
 
/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq-private_data, 1);
@@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
+   packets_coalesced += 1;
+   bytes_coalesced += len;
+   bufs_coalesced += out;
+   if (unlikely(packets_coalesced  tx_packets_coalesce ||
+bytes_coalesced  tx_bytes_coalesce ||
+bufs_coalesced  tx_bufs_coalesce))
+   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   else
+   vhost_add_used(vq, head, 0);
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
}
}
 
+   if (likely(packets_coalesced 
+  bytes_coalesced 
+  bufs_coalesced))
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 0.14] vhost: force vhost off for non-MSI guests

2011-02-01 Thread Michael S. Tsirkin

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

OK this is 0.14 material so quick review would be appreciated.
This version's compiled only, I'll naturally test more before I
queue it.

Changes from v2:
get rid of EVIRTIO, add a separate API to query
for fast guest notifiers

Changes from v1:
add vhostforce flag

 hw/vhost.c  |   10 +-
 hw/vhost.h  |4 +++-
 hw/vhost_net.c  |   24 ++--
 hw/vhost_net.h  |3 ++-
 hw/virtio-net.c |6 +-
 hw/virtio-pci.c |7 +++
 hw/virtio.h |1 +
 net/tap.c   |6 --
 qemu-options.hx |4 +++-
 9 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..38cc3b3 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
   0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
 {
 uint64_t features;
 int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
 hdev-log_enabled = false;
 hdev-started = false;
 cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
 return 0;
 fail:
 r = -errno;
@@ -627,6 +628,13 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
 close(hdev-control);
 }
 
+bool vhost_dev_query(struct vhost_dev *hdev, VirtIODevice *vdev)
+{
+return !vdev-binding-query_guest_notifiers ||
+vdev-binding-query_guest_notifiers(vdev-binding_opaque) ||
+hdev-force;
+}
+
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
 int i, r;
diff --git a/hw/vhost.h b/hw/vhost.h
index 86dd834..c8c595a 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -38,10 +38,12 @@ struct vhost_dev {
 bool log_enabled;
 vhost_log_chunk_t *log;
 unsigned long long log_size;
+bool force;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
+bool vhost_dev_query(struct vhost_dev *hdev, VirtIODevice *vdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
 
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index c068be1..420e05f 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
 }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
 {
 int r;
 struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, 
int devfd)
 (1  VHOST_NET_F_VIRTIO_NET_HDR);
 net-backend = r;
 
-r = vhost_dev_init(net-dev, devfd);
+r = vhost_dev_init(net-dev, devfd, force);
 if (r  0) {
 goto fail;
 }
@@ -121,6 +122,11 @@ fail:
 return NULL;
 }
 
+bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
+{
+return vhost_dev_query(net-dev, dev);
+}
+
 int vhost_net_start(struct vhost_net *net,
 VirtIODevice *dev)
 {
@@ -188,15 +194,21 @@ void vhost_net_cleanup(struct vhost_net *net)
 qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
+{
+return NULL;
+}
+
+bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 {
-   return NULL;
+return false;
 }
 
 int vhost_net_start(struct vhost_net *net,
VirtIODevice *dev)
 {
-   return -ENOSYS;
+return -ENOSYS;
 }
 void vhost_net_stop(struct vhost_net *net,
VirtIODevice *dev)
@@ -209,7 +221,7 @@ void vhost_net_cleanup(struct vhost_net *net)
 
 unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
 {
-   return features;
+return features;
 }
 void vhost_net_ack_features(struct vhost_net *net, unsigned features)
 {
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 6c18ff7..91e40b1 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -6,8 +6,9 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 12:25:08PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote:
On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
   Interesting. Could this is be a variant of the now famuous
  bufferbloat then?
  
  Sigh, bufferbloat is the new global warming... :-/ 
 
 Yep, some places become colder, some other places become warmer;
Same as
 BW results, sometimes faster, sometimes slower. :)
 
 Shirley

Sent a tuning patch (v2) that might help.
Could you try it and play with the module parameters please? 
   
   Hello Michael,
   
   Sure I will play with this patch to see how it could help. 
   
   I am looking at guest side as well, I found a couple issues on guest
   side:
   
   1. free_old_xmit_skbs() should return the number of skbs instead of
  the
   total of sgs since we are using ring size to stop/start netif queue.
   static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
   {
   struct sk_buff *skb;
   unsigned int len, tot_sgs = 0;
   
   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
   pr_debug(Sent skb %p\n, skb);
   vi-dev-stats.tx_bytes += skb-len;
   vi-dev-stats.tx_packets++;
   tot_sgs += skb_vnet_hdr(skb)-num_sg;
   dev_kfree_skb_any(skb);
   }
   return tot_sgs;  should return numbers of skbs to track
   ring usage here, I think;
   }
   
   Did the old guest use number of buffers to track ring usage before?
   
   2. In start_xmit, I think we should move capacity +=
  free_old_xmit_skbs
   before netif_stop_queue(); so we avoid unnecessary netif queue
   stop/start. This condition is heavily hit for small message size.
   
   Also we capacity checking condition should change to something like
  half
   of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS?
   
  if (capacity  2+MAX_SKB_FRAGS) {
   netif_stop_queue(dev);
   if (unlikely(!virtqueue_enable_cb(vi-svq))) {
   /* More just got used, free them then
  recheck.
   */
   capacity += free_old_xmit_skbs(vi);
   if (capacity = 2+MAX_SKB_FRAGS) {
   netif_start_queue(dev);
   virtqueue_disable_cb(vi-svq);
   }
   }
   }
   
   3. Looks like the xmit callback is only used to wake the queue when
  the
   queue has stopped, right? Should we put a condition check here?
   static void skb_xmit_done(struct virtqueue *svq)
   {
   struct virtnet_info *vi = svq-vdev-priv;
   
   /* Suppress further interrupts. */
   virtqueue_disable_cb(svq);
   
   /* We were probably waiting for more output buffers. */
   ---   if (netif_queue_stopped(vi-dev))
   netif_wake_queue(vi-dev);
   }
   
   
   Shirley
  
  Well the return value is used to calculate capacity and that counts
  the # of s/g. No?
 
 Nope, the current guest kernel uses descriptors not number of sgs.

Confused. We compare capacity to skb frags, no?
That's sg I think ...

 not sure the old guest.
 
  From cache utilization POV it might be better to read from the skb and
  not peek at virtio header though...
  Pls Cc the lists on any discussions in the future.
  
  -- 
  MST
 
 Sorry I missed reply all. :(
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote:
 On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
  Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
  
  I tried a simpler version of this patch without any tunables by
  delaying the signaling until we come out of the for loop.
  It definitely reduced the number of vmexits significantly for small
  message
  guest to host stream test and the throughput went up a little.
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index 9b3ca10..5f9fae9 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
  pr_debug(Truncated TX packet: 
len %d != %zd\n, err, len);
  -   vhost_add_used_and_signal(net-dev, vq, head, 0);
  +   vhost_add_used(vq, head, 0);
  total_len += len;
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
  vhost_poll_queue(vq-poll);
  @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
  }
  }
  
  +   if (total_len  0)
  +   vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
 
 Reducing the signaling will reduce the CPU utilization by reducing VM
 exits. 
 
 The small message BW is a problem we have seen faster guest/slow vhost,
 even I increased VHOST_NET_WEIGHT times, it didn't help that much for
 BW. For large message size, vhost is able to process all packets on
 time. I played around with guest/host codes, I only see huge BW
 improvement by dropping packets on guest side so far.
 
 Thanks
 Shirley


My theory is that the issue is not signalling.
Rather, our queue fills up, then host handles
one packet and sends an interrupt, and we
immediately wake the queue. So the vq
once it gets full, stays full.

If you try my patch with bufs threshold set to e.g.
half the vq, what we will do is send interrupt after we have processed
half the vq.  So host has half the vq to go, and guest has half the vq
to fill.

See?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
  Confused. We compare capacity to skb frags, no?
  That's sg I think ...
 
 Current guest kernel use indirect buffers, num_free returns how many
 available descriptors not skb frags. So it's wrong here.
 
 Shirley

I see. Good point. In other words when we complete the buffer
it was indirect, but when we add a new one we
can not allocate indirect so we consume.
And then we start the queue and add will fail.
I guess we need some kind of API to figure out
whether the buf we complete was indirect?

Another failure mode is when skb_xmit_done
wakes the queue: it might be too early, there
might not be space for the next packet in the vq yet.

A solution might be to keep some kind of pool
around for indirect, we wanted to do it for block anyway ...

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
  My theory is that the issue is not signalling.
  Rather, our queue fills up, then host handles
  one packet and sends an interrupt, and we
  immediately wake the queue. So the vq
  once it gets full, stays full.
 
 From the printk debugging output, it might not be exactly the case. The
 ring gets full, run a bit, then gets full, then run a bit, then full...

Yes, but does it get even half empty in between?

  If you try my patch with bufs threshold set to e.g.
  half the vq, what we will do is send interrupt after we have processed
  half the vq.  So host has half the vq to go, and guest has half the vq
  to fill.
  
  See?
 
 I am cleaning up my set up to run your patch ...
 
 Shirley
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 dontapply] vhost-net tx tuning

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 03:07:38PM -0800, Sridhar Samudrala wrote:
 On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
  
  Maybe the following will help it stabilize?  By default with it we will
  only interrupt when we see an empty ring.
  Which is liklely too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value like half ring * 200, or packets some
  small value etc.
  
  Set any one parameter to 0 to get current
  behaviour (interrupt immediately when enabled).
  
  Warning: completely untested.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  ---
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
  
  +int tx_bytes_coalesce = 10;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 10;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 10;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
  
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
  pr_debug(Truncated TX packet: 
len %d != %zd\n, err, len);
  -   vhost_add_used_and_signal(net-dev, vq, head, 0);
  total_len += len;
  +   packets_coalesced += 1;
  +   bytes_coalesced += len;
  +   bufs_coalesced += out;
  +   if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +bytes_coalesced  tx_bytes_coalesce ||
  +bufs_coalesced  tx_bufs_coalesce))
  +   vhost_add_used_and_signal(net-dev, vq, head, 0);
 
 I think the counters that exceed the limits need to be reset to 0 here.
 Otherwise we keep signaling for every buffer once we hit this condition.
 
 Thanks
 Sridhar

Correct, good catch.

  +   else
  +   vhost_add_used(vq, head, 0);
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
  vhost_poll_queue(vq-poll);
  break;
  }
  }
  
  +   if (likely(packets_coalesced 
  +  bytes_coalesced 
  +  bufs_coalesced))
  +   vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
  There are flags for bytes, buffers and packets.
  Try playing with any one of them :)
  Just be sure to use v2.
  
  
  I would like to change it to
   half of the ring size instead for signaling. Is that OK?
   
   Shirley
   
   
  
  Sure that is why I made it a parameter so you can experiment. 
 
 The initial test results shows that the CPUs utilization has been
 reduced some, and BW has increased some with the default parameters,
 like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
 down from 4x% to 38%, (Similar results from the patch I submitted a
 while ago to reduce signaling on vhost) but far away from dropping
 packet results.
 
 I am going to change the code to use 1/2 ring size to wake the netif
 queue.
 
 Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64  /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
 
  Another failure mode is when skb_xmit_done
  wakes the queue: it might be too early, there
  might not be space for the next packet in the vq yet.
 
 I am not sure if this is the problem - shouldn't you
 see these messages:
   if (likely(capacity == -ENOMEM)) {
   dev_warn(dev-dev,
   TX queue failure: out of memory\n);
   } else {
   dev-stats.tx_fifo_errors++;
   dev_warn(dev-dev,
   Unexpected TX queue failure: %d\n,
   capacity);
   }
 in next xmit? I am not getting this in my testing.

Yes, I don't think we hit this in our testing,
simply because we don't stress memory.
Disable indirect, then you might see this.

  A solution might be to keep some kind of pool
  around for indirect, we wanted to do it for block anyway ...
 
 Your vhost patch should fix this automatically. Right?

Reduce the chance of it happening, yes.

 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin

On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
  
  The way I am changing is only when netif queue has stopped, then we
  start to count num_free descriptors to send the signal to wake netif
  queue. 
 
 I forgot to mention, the code change I am making is in guest kernel, in
 xmit call back only wake up the queue when it's stopped  num_free =
 1/2 *vq-num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

 However vhost signaling reduction is needed as well. The patch I
 submitted a while ago showed both CPUs and BW improvement.
 
 Thanks
 Shirley

Which patch was that?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
  Yes, I think doing this in the host is much simpler,
  just send an interrupt after there's a decent amount
  of space in the queue.
  
  Having said that the simple heuristic that I coded
  might be a bit too simple.
 
 From the debugging out what I have seen so far (a single small message
 TCP_STEAM test), I think the right approach is to patch both guest and
 vhost.

One problem is slowing down the guest helps here.
So there's a chance that just by adding complexity
in guest driver we get a small improvement :(

We can't rely on a patched guest anyway, so
I think it is best to test guest and host changes separately.

And I do agree something needs to be done in guest too,
for example when vqs share an interrupt, we
might invoke a callback when we see vq is not empty
even though it's not requested. Probably should
check interrupts enabled here?

 The problem I have found is a regression for single  small
 message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
 kernel has problem.

Likely new kernel is faster :)

 For Steven's problem, it's multiple stream TCP_RR issues, the old guest
 doesn't perform well, so does new guest kernel. We tested reducing vhost
 signaling patch before, it didn't help the performance at all.
 
 Thanks
 Shirley

Yes, it seems unrelated to tx interrupts.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
w/i guest change, I played around the parameters,for example: I
  could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar
  CPU
usage. 
   
   I meant w/o guest change, only vhost changes. Sorry about that.
   
   Shirley
  
  Ah, excellent. What were the parameters? 
 
 I used half of the ring size 129 for packet counters, but the
 performance is still not as good as dropping packets on guest, 3.7 Gb/s
 vs. 6.2Gb/s.
 
 Shirley

And this is with sndbuf=0 in host, yes?
And do you see a lot of tx interrupts?
How packets per interrupt?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
   On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.
   
   From the debugging out what I have seen so far (a single small
  message
   TCP_STEAM test), I think the right approach is to patch both guest
  and
   vhost.
  
  One problem is slowing down the guest helps here.
  So there's a chance that just by adding complexity
  in guest driver we get a small improvement :(
  
  We can't rely on a patched guest anyway, so
  I think it is best to test guest and host changes separately.
  
  And I do agree something needs to be done in guest too,
  for example when vqs share an interrupt, we
  might invoke a callback when we see vq is not empty
  even though it's not requested. Probably should
  check interrupts enabled here?
 
 Yes, I modified xmit callback something like below:
 
 static void skb_xmit_done(struct virtqueue *svq)
 {
 struct virtnet_info *vi = svq-vdev-priv;
 
 /* Suppress further interrupts. */
 virtqueue_disable_cb(svq);
 
 /* We were probably waiting for more output buffers. */
 if (netif_queue_stopped(vi-dev)) {
 free_old_xmit_skbs(vi);
 if (virtqueue_free_size(svq)  = svq-vring.num / 2) {
 virtqueue_enable_cb(svq);
   return;
   }
 }
   netif_wake_queue(vi-dev);
 }

OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?

   The problem I have found is a regression for single  small
   message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
  new
   kernel has problem.
  
  Likely new kernel is faster :)
 
   For Steven's problem, it's multiple stream TCP_RR issues, the old
  guest
   doesn't perform well, so does new guest kernel. We tested reducing
  vhost
   signaling patch before, it didn't help the performance at all.
   
   Thanks
   Shirley
  
  Yes, it seems unrelated to tx interrupts. 
 
 The issue is more likely related to latency.

Could be. Why do you think so?

 Do you have anything in
 mind on how to reduce vhost latency?
 
 Thanks
 Shirley

Hmm, bypassing the bridge might help a bit.
Are you using tap+bridge or macvtap?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
w/i guest change, I played around the parameters,for example: I
  could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar
  CPU
usage. 
   
   I meant w/o guest change, only vhost changes. Sorry about that.
   
   Shirley
  
  Ah, excellent. What were the parameters? 
 
 I used half of the ring size 129 for packet counters,
 but the
 performance is still not as good as dropping packets on guest, 3.7 Gb/s
 vs. 6.2Gb/s.
 
 Shirley

How many packets and bytes per interrupt are sent?
Also, what about other values for the counters and other counters?

What does your patch do? Just drop packets instead of
stopping the interface?

To have an understanding when should we drop packets
in the guest, we need to know *why* does it help.
Otherwise, how do we know it will work for others?
Note that qdisc will drop packets when it overruns -
so what is different? Also, are we over-running some other queue
somewhere?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
  OK, but this should have no effect with a vhost patch
  which should ensure that we don't get an interrupt
  until the queue is at least half empty.
  Right?
 
 There should be some coordination between guest and vhost.

What kind of coordination? With a patched vhost, and a full ring.
you should get an interrupt per 100 packets.
Is this what you see? And if yes, isn't the guest patch
doing nothing then?

 We shouldn't
 count the TX packets when netif queue is enabled since next guest TX
 xmit will free any used buffers in vhost. We need to be careful here in
 case we miss the interrupts when netif queue has stopped.
 
 However we can't change old guest so we can test the patches separately
 for guest only, vhost only, and the combination.
 

Yes, it seems unrelated to tx interrupts. 
   
   The issue is more likely related to latency.
  
  Could be. Why do you think so?
 
 Since I played with latency hack, I can see performance difference for
 different latency.

Which hack was that?

   Do you have anything in
   mind on how to reduce vhost latency?
   
   Thanks
   Shirley
  
  Hmm, bypassing the bridge might help a bit.
  Are you using tap+bridge or macvtap? 
 
 I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
 before. He might have some data from his workload performance
 measurement.
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
 
 I am working on a KVM network performance issue found in our lab running
 the DayTrader benchmark.  The benchmark throughput takes a significant hit
 when running the application server in a KVM guest verses on bare metal.
 We have dug into the problem and found that DayTrader's use of small
 packets exposes KVM's overhead of handling network packets.  I have been
 able to reproduce the performance hit with a simpler setup using the
 netperf benchmark with the TCP_RR test and the request and response sizes
 set to 256 bytes.  I run the benchmark between two physical systems, each
 using a 1GB link.  In order to get the maximum throughput for the system I
 have to run 100 instances of netperf.  When I run the netserver processes
 in a guest, I see a maximum throughput that is 51% of what I get if I run
 the netserver processes directly on the host.  The CPU utilization in the
 guest is only 85% at maximum throughput, whereas it is 100% on bare metal.

You are stressing the scheduler pretty hard with this test :)
Is your real benchmark also using a huge number of threads?
If it's not, you might be seeing a different issue.
IOW, the netperf degradation might not be network-related at all,
but have to do with speed of context switch in guest.
Thoughts?

 The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
 I run netperf on the host I boot the host with maxcpus=2 on the kernel
 command line.  The host is running the current KVM upstream kernel along
 with the current upstream qemu.  Here is the qemu command used to launch
 the guest:
 /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 
 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
  -drive file=/dev/virt/WAS,if=virtio,index=1 -net 
 nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev 
 tap,id=nic0,vhost=on -smp 2
 -vnc :1 -monitor telnet::4499,server,nowait -serial 
 telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
 
 We have tried various proposed fixes, each with varying amounts of success.
 One such fix was to add code to the vhost thread such that when it found
 the work queue empty it wouldn't just exit the thread but rather would
 delay for 50 microseconds and then recheck the queue.  If there was work on
 the queue it would loop back and process it, else it would exit the thread.
 The change got us a 13% improvement in the DayTrader throughput.
 
 Running the same netperf configuration on the same hardware but using a
 different hypervisor gets us significantly better throughput numbers.   The
 guest on that hypervisor runs at 100% CPU utilization.  The various fixes
 we have tried have not gotten us close to the throughput seen on the other
 hypervisor.  I'm looking for ideas/input from the KVM experts on how to
 make KVM perform better when handling small packets.
 
 Thanks,
 Steve
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
   On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?
   
   There should be some coordination between guest and vhost.
  
  What kind of coordination? With a patched vhost, and a full ring.
  you should get an interrupt per 100 packets.
  Is this what you see? And if yes, isn't the guest patch
  doing nothing then?
 
 vhost_signal won't be able send any TX interrupts to guest when guest TX
 interrupt is disabled. Guest TX interrupt is only enabled when running
 out of descriptors.

Well, this is also the only case where the queue is stopped, no?

   We shouldn't
   count the TX packets when netif queue is enabled since next guest TX
   xmit will free any used buffers in vhost. We need to be careful here
  in
   case we miss the interrupts when netif queue has stopped.
   
   However we can't change old guest so we can test the patches
  separately
   for guest only, vhost only, and the combination.
   
  
  Yes, it seems unrelated to tx interrupts. 
 
 The issue is more likely related to latency.

Could be. Why do you think so?
   
   Since I played with latency hack, I can see performance difference
  for
   different latency.
  
  Which hack was that? 
 
 I tried to accumulate multiple guest to host notifications for TX xmits,
 it did help multiple streams TCP_RR results;

I don't see a point to delay used idx update, do you?
So delaying just signal seems better, right?

 I also forced vhost
 handle_tx to handle more packets; both hack seemed help.
 
 Thanks
 Shirley

Haven't noticed that part, how does your patch make it
handle more packets?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
  Well, this is also the only case where the queue is stopped, no?
 Yes. I got some debugging data, I saw that sometimes there were so many
 packets were waiting for free in guest between vhost_signal  guest xmit
 callback.

What does this mean?

 Looks like the time spent too long from vhost_signal to guest
 xmit callback?



   I tried to accumulate multiple guest to host notifications for TX
  xmits,
   it did help multiple streams TCP_RR results;
  I don't see a point to delay used idx update, do you?
 
 It might cause per vhost handle_tx processed more packets.

I don't understand. It's a couple of writes - what is the issue?

  So delaying just signal seems better, right?
 
 I think I need to define the test matrix to collect data for TX xmit
 from guest to host here for different tests.
 
 Data to be collected:
 -
 1. kvm_stat for VM, I/O exits
 2. cpu utilization for both guest and host
 3. cat /proc/interrupts on guest
 4. packets rate from vhost handle_tx per loop
 5. guest netif queue stop rate
 6. how many packets are waiting for free between vhost signaling and
 guest callback
 7. performance results
 
 Test
 
 1. TCP_STREAM single stream test for 1K to 4K message size
 2. TCP_RR (64 instance test): 128 - 1K request/response size
 
 Different hacks
 ---
 1. Base line data ( with the patch to fix capacity check first,
 free_old_xmit_skbs returns number of skbs)
 
 2. Drop packet data (will put some debugging in generic networking code)
 
 3. Delay guest netif queue wake up until certain descriptors (1/2 ring
 size, 1/4 ring size...) are available once the queue has stopped.
 
 4. Accumulate more packets per vhost signal in handle_tx?
 
 5. 3  4 combinations
 
 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 
 
 7. Accumulate more packets per vhost handle_tx() by adding some delay?
 
  Haven't noticed that part, how does your patch make it
 handle more packets?
 
 Added a delay in handle_tx().
 
 What else?
 
 It would take sometimes to do this.
 
 Shirley


Need to think about this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
   On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
Well, this is also the only case where the queue is stopped, no?
   Yes. I got some debugging data, I saw that sometimes there were so
  many
   packets were waiting for free in guest between vhost_signal  guest
  xmit
   callback.
  
  What does this mean?
 
 Let's look at the sequence here:
 
 guest start_xmit()
   xmit_skb()
   if ring is full,
   enable_cb()
 
 guest skb_xmit_done()
   disable_cb,
 printk free_old_xmit_skbs -- it was between more than 1/2 to
 full ring size
   printk vq-num_free 
 
 vhost handle_tx()
   if (guest interrupt is enabled)
   signal guest to free xmit buffers
 
 So between guest queue full/stopped queue/enable call back to guest
 receives the callback from host to free_old_xmit_skbs, there were about
 1/2 to full ring size descriptors available. I thought there were only a
 few. (I disabled your vhost patch for this test.)


The expected number is vq-num - max skb frags - 2.

 
   Looks like the time spent too long from vhost_signal to guest
   xmit callback?
  
  
  
 I tried to accumulate multiple guest to host notifications for
  TX
xmits,
 it did help multiple streams TCP_RR results;
I don't see a point to delay used idx update, do you?
   
   It might cause per vhost handle_tx processed more packets.
  
  I don't understand. It's a couple of writes - what is the issue?
 
 Oh, handle_tx could process more packets per loop for multiple streams
 TCP_RR case. I need to print out the data rate per loop to confirm this.
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
   I think I need to define the test matrix to collect data for TX xmit
   from guest to host here for different tests.
   
   Data to be collected:
   -
   1. kvm_stat for VM, I/O exits
   2. cpu utilization for both guest and host
   3. cat /proc/interrupts on guest
   4. packets rate from vhost handle_tx per loop
   5. guest netif queue stop rate
   6. how many packets are waiting for free between vhost signaling and
   guest callback
   7. performance results
   
   Test
   
   1. TCP_STREAM single stream test for 1K to 4K message size
   2. TCP_RR (64 instance test): 128 - 1K request/response size
   
   Different hacks
   ---
   1. Base line data ( with the patch to fix capacity check first,
   free_old_xmit_skbs returns number of skbs)
   
   2. Drop packet data (will put some debugging in generic networking
  code)
 
 Since I found that the netif queue stop/wake up is so expensive, I
 created a dropping packets patch on guest side so I don't need to debug
 generic networking code.
 
 guest start_xmit()
   capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
   if (capacity == 0)
   drop this packet;
   return;
 
 In the patch, both guest TX interrupts and callback have been omitted.
 Host vhost_signal in handle_tx can totally be removed as well. (A new
 virtio_ring API is needed for exporting total of num_free descriptors
 here -- virtioqueue_get_num_freed)
 
 Initial TCP_STREAM performance results I got for guest to local host 
 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
 9.8Gb/s for 4K message size. (vs.5.xGb/s)

What is the average packet size, # bytes per ack, and the # of interrupts
per packet? It could be that just slowing down trahsmission
makes GSO work better.

 Since large message size (64K) doesn't hit (capacity == 0) case, so the
 performance only has a little better. (from 13.xGb/s to 14.x Gb/s)
 
 kvm_stat output shows significant exits reduction for both VM and I/O,
 no guest TX interrupts.
 
 With dropping packets, TCP retrans has been increased here, so I can see
 performance numbers are various.
 
 This might be not a good solution, but it gave us some ideas on
 expensive netif queue stop/wake up between guest and host notification.
 
 I couldn't find a better solution on how to reduce netif queue stop/wake
 up rate for small message size. But I think once we can address this,
 the guest TX performance will burst for small message size.
 
 I also compared this with return TX_BUSY approach when (capacity == 0),
 it is not as good as dropping packets.
 
   3. Delay guest netif queue wake up until certain descriptors (1/2
  ring
   size, 1/4 ring size...) are available once the queue has stopped.
   
   4. Accumulate more packets per vhost signal in handle_tx?
   
   5. 3  4 combinations
   
   6. Accumulate more packets per guest kick() (TCP_RR) by adding a
  timer? 
   
   7. Accumulate more packets per vhost handle_tx() by adding some
  delay?
   
Haven't noticed that part, how does your patch make it
   handle more packets?
   
   Added a delay in handle_tx().
   
   What else?
   
   It would take sometimes to do this.
   
   Shirley
  
  
  Need to think about this.
  
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin

On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote:
 On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
   Let's look at the sequence here:
   
   guest start_xmit()
 xmit_skb()
 if ring is full,
 enable_cb()
   
   guest skb_xmit_done()
 disable_cb,
   printk free_old_xmit_skbs -- it was between more than 1/2
  to
   full ring size
 printk vq-num_free 
   
   vhost handle_tx()
 if (guest interrupt is enabled)
 signal guest to free xmit buffers
   
   So between guest queue full/stopped queue/enable call back to guest
   receives the callback from host to free_old_xmit_skbs, there were
  about
   1/2 to full ring size descriptors available. I thought there were
  only a
   few. (I disabled your vhost patch for this test.)
  
  
  The expected number is vq-num - max skb frags - 2. 
 
 It was various (up to the ring size 256). This is using indirection
 buffers, it returned how many freed descriptors, not number of buffers.
 
 Why do you think it is vq-num - max skb frags - 2 here?
 
 Shirley

well queue is stopped which happens when

if (capacity  2+MAX_SKB_FRAGS) {
netif_stop_queue(dev);
if (unlikely(!virtqueue_enable_cb(vi-svq))) {
/* More just got used, free them then recheck.
 * */
capacity += free_old_xmit_skbs(vi);
if (capacity = 2+MAX_SKB_FRAGS) {
netif_start_queue(dev);
virtqueue_disable_cb(vi-svq);
}
}
}

This should be the most common case.
I guess the case with += free_old_xmit_skbs is what can get us more.
But it should be rare. Can you count how common it is?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-03 Thread Michael S. Tsirkin

On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote:
 On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
   Initial TCP_STREAM performance results I got for guest to local
  host 
   4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
   6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
   9.8Gb/s for 4K message size. (vs.5.xGb/s)
  
  What is the average packet size, # bytes per ack, and the # of
  interrupts
  per packet? It could be that just slowing down trahsmission
  makes GSO work better. 
 
 There is no TX interrupts with dropping packet.
 
 GSO/TSO is the key for small message performance, w/o GSO/TSO, the
 performance is limited to about 2Gb/s no matter how big the message size
 it is. I think any work we try here will increase large packet size
 rate. BTW for dropping packet, TCP increased fast retrans, not slow
 start. 
 
 I will collect tcpdump, netstart before and after data to compare packet
 size/rate w/o w/i the patch.
 
 Thanks
 Shirley

Just a thought: does it help to make tx queue len of the
virtio device smaller?
E.g. match the vq size?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/13] AMD IOMMU emulation

2011-02-06 Thread Michael S. Tsirkin

On Fri, Feb 04, 2011 at 01:32:57AM +0200, Eduard - Gabriel Munteanu wrote:
 This introduces emulation for the AMD IOMMU, described in AMD I/O
 Virtualization Technology (IOMMU) Specification.
 
 Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro
 ---
  Makefile.target |2 +-
  hw/amd_iommu.c  |  694 
 +++
  hw/pc.c |2 +
  hw/pci_ids.h|2 +
  hw/pci_regs.h   |1 +
  5 files changed, 700 insertions(+), 1 deletions(-)
  create mode 100644 hw/amd_iommu.c
 
 diff --git a/Makefile.target b/Makefile.target
 index e5817ab..4b650bd 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
  obj-i386-y += debugcon.o multiboot.o
 -obj-i386-y += pc_piix.o dma_rw.o
 +obj-i386-y += pc_piix.o dma_rw.o amd_iommu.o
  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
  
  # shared objects
 diff --git a/hw/amd_iommu.c b/hw/amd_iommu.c
 new file mode 100644
 index 000..6c6346a
 --- /dev/null
 +++ b/hw/amd_iommu.c
 @@ -0,0 +1,694 @@
 +/*
 + * AMD IOMMU emulation
 + *
 + * Copyright (c) 2011 Eduard - Gabriel Munteanu
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include pc.h
 +#include hw.h
 +#include pci.h
 +#include qlist.h
 +#include dma_rw.h
 +
 +/* Capability registers */
 +#define CAPAB_HEADER0x00
 +#define   CAPAB_REV_TYPE0x02
 +#define   CAPAB_FLAGS   0x03
 +#define CAPAB_BAR_LOW   0x04
 +#define CAPAB_BAR_HIGH  0x08
 +#define CAPAB_RANGE 0x0C
 +#define CAPAB_MISC  0x10
 +
 +#define CAPAB_SIZE  0x14
 +#define CAPAB_REG_SIZE  0x04
 +
 +/* Capability header data */
 +#define CAPAB_FLAG_IOTLBSUP (1  0)
 +#define CAPAB_FLAG_HTTUNNEL (1  1)
 +#define CAPAB_FLAG_NPCACHE  (1  2)
 +#define CAPAB_INIT_REV  (1  3)
 +#define CAPAB_INIT_TYPE 3
 +#define CAPAB_INIT_REV_TYPE (CAPAB_REV | CAPAB_TYPE)
 +#define CAPAB_INIT_FLAGS(CAPAB_FLAG_NPCACHE | CAPAB_FLAG_HTTUNNEL)
 +#define CAPAB_INIT_MISC (64  15) | (48  8)
 +#define CAPAB_BAR_MASK  ~((1UL  14) - 1)
 +
 +/* MMIO registers */
 +#define MMIO_DEVICE_TABLE   0x
 +#define MMIO_COMMAND_BASE   0x0008
 +#define MMIO_EVENT_BASE 0x0010
 +#define MMIO_CONTROL0x0018
 +#define MMIO_EXCL_BASE  0x0020
 +#define MMIO_EXCL_LIMIT 0x0028
 +#define MMIO_COMMAND_HEAD   0x2000
 +#define MMIO_COMMAND_TAIL   0x2008
 +#define MMIO_EVENT_HEAD 0x2010
 +#define MMIO_EVENT_TAIL 0x2018
 +#define MMIO_STATUS 0x2020
 +
 +#define MMIO_SIZE   0x4000
 +
 +#define MMIO_DEVTAB_SIZE_MASK   ((1ULL  12) - 1)
 +#define MMIO_DEVTAB_BASE_MASK   (((1ULL  52) - 1)  ~MMIO_DEVTAB_SIZE_MASK)
 +#define MMIO_DEVTAB_ENTRY_SIZE  32
 +#define MMIO_DEVTAB_SIZE_UNIT   4096
 +
 +#define MMIO_CMDBUF_SIZE_BYTE   (MMIO_COMMAND_BASE + 7)
 +#define MMIO_CMDBUF_SIZE_MASK   0x0F
 +#define MMIO_CMDBUF_BASE_MASK   MMIO_DEVTAB_BASE_MASK
 +#define MMIO_CMDBUF_DEFAULT_SIZE8
 +#define MMIO_CMDBUF_HEAD_MASK   (((1ULL  19) - 1)  ~0x0F)
 +#define MMIO_CMDBUF_TAIL_MASK   MMIO_EVTLOG_HEAD_MASK
 +
 +#define MMIO_EVTLOG_SIZE_BYTE   (MMIO_EVENT_BASE + 7)
 +#define MMIO_EVTLOG_SIZE_MASK   MMIO_CMDBUF_SIZE_MASK
 +#define MMIO_EVTLOG_BASE_MASK   MMIO_CMDBUF_BASE_MASK
 +#define MMIO_EVTLOG_DEFAULT_SIZEMMIO_CMDBUF_DEFAULT_SIZE
 +#define MMIO_EVTLOG_HEAD_MASK   (((1ULL  19) - 1)  ~0x0F)
 +#define MMIO_EVTLOG_TAIL_MASK   MMIO_EVTLOG_HEAD_MASK
 +
 +#define MMIO_EXCL_BASE_MASK MMIO_DEVTAB_BASE_MASK
 +#define MMIO_EXCL_ENABLED_MASK  (1ULL  0)
 +#define MMIO_EXCL_ALLOW_MASK(1ULL  1)
 +#define

Re: [PATCH 01/13] Generic DMA memory access interface

2011-02-06 Thread Michael S. Tsirkin

On Fri, Feb 04, 2011 at 01:32:55AM +0200, Eduard - Gabriel Munteanu wrote:
 This introduces replacements for memory access functions like
 cpu_physical_memory_read(). The new interface can handle address
 translation and access checking through an IOMMU.
 
 Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro
 ---
  Makefile.target |2 +-
  hw/dma_rw.c |  124 +++
  hw/dma_rw.h |  157 
 +++
  3 files changed, 282 insertions(+), 1 deletions(-)
  create mode 100644 hw/dma_rw.c
  create mode 100644 hw/dma_rw.h
 
 diff --git a/Makefile.target b/Makefile.target
 index e15b1c4..e5817ab 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
  obj-i386-y += debugcon.o multiboot.o
 -obj-i386-y += pc_piix.o
 +obj-i386-y += pc_piix.o dma_rw.o
  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
  
  # shared objects
 diff --git a/hw/dma_rw.c b/hw/dma_rw.c
 new file mode 100644
 index 000..ef8e7f8
 --- /dev/null
 +++ b/hw/dma_rw.c
 @@ -0,0 +1,124 @@
 +/*
 + * Generic DMA memory access interface.
 + *
 + * Copyright (c) 2011 Eduard - Gabriel Munteanu
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include dma_rw.h
 +#include range.h
 +
 +static void dma_register_memory_map(DMADevice *dev,
 +dma_addr_t addr,
 +dma_addr_t len,
 +target_phys_addr_t paddr,
 +DMAInvalidateMapFunc *invalidate,
 +void *invalidate_opaque)
 +{
 +DMAMemoryMap *map;
 +
 +map = qemu_malloc(sizeof(DMAMemoryMap));
 +map-addr   = addr;
 +map-len= len;
 +map-paddr  = paddr;
 +map-invalidate = invalidate;
 +map-invalidate_opaque  = invalidate_opaque;
 +
 +QLIST_INSERT_HEAD(dev-mmu-memory_maps, map, list);
 +}
 +
 +static void dma_unregister_memory_map(DMADevice *dev,
 +  target_phys_addr_t paddr,
 +  dma_addr_t len)
 +{
 +DMAMemoryMap *map;
 +
 +QLIST_FOREACH(map, dev-mmu-memory_maps, list) {
 +if (map-paddr == paddr  map-len == len) {
 +QLIST_REMOVE(map, list);
 +free(map);
 +}
 +}
 +}
 +
 +void dma_invalidate_memory_range(DMADevice *dev,
 + dma_addr_t addr,
 + dma_addr_t len)
 +{
 +DMAMemoryMap *map;
 +
 +QLIST_FOREACH(map, dev-mmu-memory_maps, list) {
 +if (ranges_overlap(addr, len, map-addr, map-len)) {
 +map-invalidate(map-invalidate_opaque);
 +QLIST_REMOVE(map, list);
 +free(map);
 +}
 +}
 +}
 +
 +void *dma_memory_map(DMADevice *dev,
 + DMAInvalidateMapFunc *cb,
 + void *opaque,
 + dma_addr_t addr,
 + dma_addr_t *len,
 + int is_write)
 +{
 +int err;
 +target_phys_addr_t paddr, plen;
 +
 +if (!dev || !dev-mmu) {
 +return cpu_physical_memory_map(addr, len, is_write);
 +}
 +
 +plen = *len;
 +err = dev-mmu-translate(dev, addr, paddr, plen, is_write);
 +if (err) {
 +return NULL;
 +}
 +
 +/*
 + * If this is true, the virtual region is contiguous,
 + * but the translated physical region isn't. We just
 + * clamp *len, much like cpu_physical_memory_map() does.
 + */
 +if (plen  *len) {
 +*len = plen;
 +}
 +
 +/* We treat maps as remote TLBs to cope with stuff like AIO. */
 +if (cb) {
 +

Re: [PATCH 04/13] ide: use the DMA memory access interface for PCI IDE controllers

2011-02-06 Thread Michael S. Tsirkin

On Fri, Feb 04, 2011 at 01:32:58AM +0200, Eduard - Gabriel Munteanu wrote:
 Emulated PCI IDE controllers now use the memory access interface. This
 also allows an emulated IOMMU to translate and check accesses.
 
 Map invalidation results in cancelling DMA transfers. Since the guest OS
 can't properly recover the DMA results in case the mapping is changed,
 this is a fairly good approximation.
 
 Note this doesn't handle AHCI emulation yet!
 
 Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro

How about not changing ahci then, and failing initialization
if an mmu is present?

 ---
  dma-helpers.c |   23 ++-
  dma.h |4 +++-
  hw/ide/ahci.c |3 ++-
  hw/ide/internal.h |1 +
  hw/ide/macio.c|4 ++--
  hw/ide/pci.c  |   18 +++---
  6 files changed, 37 insertions(+), 16 deletions(-)
 
 diff --git a/dma-helpers.c b/dma-helpers.c
 index 712ed89..29a74a4 100644
 --- a/dma-helpers.c
 +++ b/dma-helpers.c
 @@ -10,12 +10,13 @@
  #include dma.h
  #include block_int.h
  
 -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint)
 +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMADevice *dma)
  {
  qsg-sg = qemu_malloc(alloc_hint * sizeof(ScatterGatherEntry));
  qsg-nsg = 0;
  qsg-nalloc = alloc_hint;
  qsg-size = 0;
 +qsg-dma = dma;
  }
  
  void qemu_sglist_add(QEMUSGList *qsg, target_phys_addr_t base,
 @@ -73,12 +74,23 @@ static void dma_bdrv_unmap(DMAAIOCB *dbs)
  int i;
  
  for (i = 0; i  dbs-iov.niov; ++i) {
 -cpu_physical_memory_unmap(dbs-iov.iov[i].iov_base,
 -  dbs-iov.iov[i].iov_len, !dbs-is_write,
 -  dbs-iov.iov[i].iov_len);
 +dma_memory_unmap(dbs-sg-dma,
 + dbs-iov.iov[i].iov_base,
 + dbs-iov.iov[i].iov_len, !dbs-is_write,
 + dbs-iov.iov[i].iov_len);
  }
  }
  
 +static void dma_bdrv_cancel(void *opaque)
 +{
 +DMAAIOCB *dbs = opaque;
 +
 +bdrv_aio_cancel(dbs-acb);
 +dma_bdrv_unmap(dbs);
 +qemu_iovec_destroy(dbs-iov);
 +qemu_aio_release(dbs);
 +}
 +
  static void dma_bdrv_cb(void *opaque, int ret)
  {
  DMAAIOCB *dbs = (DMAAIOCB *)opaque;
 @@ -100,7 +112,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
  while (dbs-sg_cur_index  dbs-sg-nsg) {
  cur_addr = dbs-sg-sg[dbs-sg_cur_index].base + dbs-sg_cur_byte;
  cur_len = dbs-sg-sg[dbs-sg_cur_index].len - dbs-sg_cur_byte;
 -mem = cpu_physical_memory_map(cur_addr, cur_len, !dbs-is_write);
 +mem = dma_memory_map(dbs-sg-dma, dma_bdrv_cancel, dbs,
 + cur_addr, cur_len, !dbs-is_write);
  if (!mem)
  break;
  qemu_iovec_add(dbs-iov, mem, cur_len);
 diff --git a/dma.h b/dma.h
 index f3bb275..2417b32 100644
 --- a/dma.h
 +++ b/dma.h
 @@ -14,6 +14,7 @@
  //#include cpu.h
  #include hw/hw.h
  #include block.h
 +#include hw/dma_rw.h
  
  typedef struct {
  target_phys_addr_t base;
 @@ -25,9 +26,10 @@ typedef struct {
  int nsg;
  int nalloc;
  target_phys_addr_t size;
 +DMADevice *dma;
  } QEMUSGList;
  
 -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint);
 +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMADevice *dma);
  void qemu_sglist_add(QEMUSGList *qsg, target_phys_addr_t base,
   target_phys_addr_t len);
  void qemu_sglist_destroy(QEMUSGList *qsg);
 diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
 index 968fdce..aea06a9 100644
 --- a/hw/ide/ahci.c
 +++ b/hw/ide/ahci.c
 @@ -993,7 +993,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, 
 QEMUSGList *sglist)
  if (sglist_alloc_hint  0) {
  AHCI_SG *tbl = (AHCI_SG *)prdt;
  
 -qemu_sglist_init(sglist, sglist_alloc_hint);
 +/* FIXME: pass a proper DMADevice. */
 +qemu_sglist_init(sglist, sglist_alloc_hint, NULL);
  for (i = 0; i  sglist_alloc_hint; i++) {
  /* flags_size is zero-based */
  qemu_sglist_add(sglist, le64_to_cpu(tbl[i].addr),
 diff --git a/hw/ide/internal.h b/hw/ide/internal.h
 index 697c3b4..3d3d5db 100644
 --- a/hw/ide/internal.h
 +++ b/hw/ide/internal.h
 @@ -468,6 +468,7 @@ struct IDEDMA {
  struct iovec iov;
  QEMUIOVector qiov;
  BlockDriverAIOCB *aiocb;
 +DMADevice *dev;
  };
  
  struct IDEBus {
 diff --git a/hw/ide/macio.c b/hw/ide/macio.c
 index c1b4caa..654ae7c 100644
 --- a/hw/ide/macio.c
 +++ b/hw/ide/macio.c
 @@ -79,7 +79,7 @@ static void pmac_ide_atapi_transfer_cb(void *opaque, int 
 ret)
  
  s-io_buffer_size = io-len;
  
 -qemu_sglist_init(s-sg, io-len / MACIO_PAGE_SIZE + 1);
 +qemu_sglist_init(s-sg, io-len / MACIO_PAGE_SIZE + 1, NULL);
  qemu_sglist_add(s-sg, io-addr, io-len);
  io-addr += io-len;
  io-len = 0;
 @@ -141,7 +141,7 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
  s-io_buffer_index = 0;
  s-io_buffer_size =

Re: [PATCH 01/13] Generic DMA memory access interface

2011-02-06 Thread Michael S. Tsirkin

On Fri, Feb 04, 2011 at 01:32:55AM +0200, Eduard - Gabriel Munteanu wrote:
 This introduces replacements for memory access functions like
 cpu_physical_memory_read(). The new interface can handle address
 translation and access checking through an IOMMU.
 
 Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro
 ---
  Makefile.target |2 +-
  hw/dma_rw.c |  124 +++
  hw/dma_rw.h |  157 
 +++
  3 files changed, 282 insertions(+), 1 deletions(-)
  create mode 100644 hw/dma_rw.c
  create mode 100644 hw/dma_rw.h
 
 diff --git a/Makefile.target b/Makefile.target
 index e15b1c4..e5817ab 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
  obj-i386-y += debugcon.o multiboot.o
 -obj-i386-y += pc_piix.o
 +obj-i386-y += pc_piix.o dma_rw.o

Does this need to be target specific?

  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
  
  # shared objects
 diff --git a/hw/dma_rw.c b/hw/dma_rw.c
 new file mode 100644
 index 000..ef8e7f8
 --- /dev/null
 +++ b/hw/dma_rw.c
 @@ -0,0 +1,124 @@
 +/*
 + * Generic DMA memory access interface.
 + *
 + * Copyright (c) 2011 Eduard - Gabriel Munteanu
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include dma_rw.h
 +#include range.h
 +
 +static void dma_register_memory_map(DMADevice *dev,
 +dma_addr_t addr,
 +dma_addr_t len,
 +target_phys_addr_t paddr,
 +DMAInvalidateMapFunc *invalidate,
 +void *invalidate_opaque)
 +{
 +DMAMemoryMap *map;
 +
 +map = qemu_malloc(sizeof(DMAMemoryMap));
 +map-addr   = addr;
 +map-len= len;
 +map-paddr  = paddr;
 +map-invalidate = invalidate;
 +map-invalidate_opaque  = invalidate_opaque;
 +
 +QLIST_INSERT_HEAD(dev-mmu-memory_maps, map, list);
 +}
 +
 +static void dma_unregister_memory_map(DMADevice *dev,
 +  target_phys_addr_t paddr,
 +  dma_addr_t len)
 +{
 +DMAMemoryMap *map;
 +
 +QLIST_FOREACH(map, dev-mmu-memory_maps, list) {
 +if (map-paddr == paddr  map-len == len) {
 +QLIST_REMOVE(map, list);
 +free(map);
 +}
 +}
 +}
 +
 +void dma_invalidate_memory_range(DMADevice *dev,
 + dma_addr_t addr,
 + dma_addr_t len)
 +{
 +DMAMemoryMap *map;
 +
 +QLIST_FOREACH(map, dev-mmu-memory_maps, list) {
 +if (ranges_overlap(addr, len, map-addr, map-len)) {
 +map-invalidate(map-invalidate_opaque);
 +QLIST_REMOVE(map, list);
 +free(map);
 +}
 +}
 +}
 +
 +void *dma_memory_map(DMADevice *dev,
 + DMAInvalidateMapFunc *cb,
 + void *opaque,
 + dma_addr_t addr,
 + dma_addr_t *len,
 + int is_write)
 +{
 +int err;
 +target_phys_addr_t paddr, plen;
 +
 +if (!dev || !dev-mmu) {
 +return cpu_physical_memory_map(addr, len, is_write);
 +}
 +
 +plen = *len;
 +err = dev-mmu-translate(dev, addr, paddr, plen, is_write);
 +if (err) {
 +return NULL;
 +}
 +
 +/*
 + * If this is true, the virtual region is contiguous,
 + * but the translated physical region isn't. We just
 + * clamp *len, much like cpu_physical_memory_map() does.
 + */
 +if (plen  *len) {
 +*len = plen;
 +}
 +
 +/* We treat maps as remote TLBs to cope

Re: [PATCH 2/3] AMD IOMMU support

2011-02-06 Thread Michael S. Tsirkin

On Fri, Feb 04, 2011 at 01:24:14AM +0200, Eduard - Gabriel Munteanu wrote:
 This initializes the AMD IOMMU and creates ACPI tables for it.
 
 Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro
 ---
  src/acpi.c |   84 
 
  src/config.h   |3 ++
  src/pci_ids.h  |1 +
  src/pci_regs.h |1 +
  src/pciinit.c  |   29 +++
  5 files changed, 118 insertions(+), 0 deletions(-)
 
 diff --git a/src/acpi.c b/src/acpi.c
 index 18830dc..fca152c 100644
 --- a/src/acpi.c
 +++ b/src/acpi.c
 @@ -196,6 +196,36 @@ struct srat_memory_affinity
  u32reserved3[2];
  } PACKED;
  
 +/*
 + * IVRS (I/O Virtualization Reporting Structure) table.
 + *
 + * Describes the AMD IOMMU, as per:
 + * AMD I/O Virtualization Technology (IOMMU) Specification, rev 1.26
 + */
 +
 +struct ivrs_ivhd
 +{
 +u8type;
 +u8flags;
 +u16   length;
 +u16   devid;
 +u16   capab_off;
 +u32   iommu_base_low;
 +u32   iommu_base_high;
 +u16   pci_seg_group;
 +u16   iommu_info;
 +u32   reserved;
 +u8entry[0];
 +} PACKED;
 +
 +struct ivrs_table
 +{
 +ACPI_TABLE_HEADER_DEF/* ACPI common table header. */
 +u32iv_info;
 +u32reserved[2];
 +struct ivrs_ivhd   ivhd;
 +} PACKED;
 +

prefix with amd_iommu_ or amd_ then ?

  #include acpi-dsdt.hex
  
  static void
 @@ -579,6 +609,59 @@ build_srat(void)
  return srat;
  }
  
 +#define IVRS_SIGNATURE 0x53525649 // IVRS
 +#define IVRS_MAX_DEVS  32
 +static void *
 +build_ivrs(void)
 +{
 +int iommu_bdf, iommu_cap;
 +int bdf, max, i;
 +struct ivrs_table *ivrs;
 +struct ivrs_ivhd *ivhd;
 +
 +/* Note this currently works for a single IOMMU! */

Meant to be a FIXME?
How hard is it to fix? Just stick this in a loop?

 +iommu_bdf = pci_find_class(PCI_CLASS_SYSTEM_IOMMU);
 +if (iommu_bdf  0)
 +return NULL;
 +iommu_cap = pci_find_capability(iommu_bdf, PCI_CAP_ID_SEC);
 +if (iommu_cap  0)
 +return NULL;
 +
 +ivrs = malloc_high(sizeof(struct ivrs_table) + 4 * IVRS_MAX_DEVS);
 +ivrs-iv_info = pci_config_readw(iommu_bdf, iommu_cap + 0x12)  ~0x000F;
 +
 +ivhd = ivrs-ivhd;
 +ivhd-type  = 0x10;
 +ivhd-flags = 0;
 +ivhd-length= sizeof(struct ivrs_ivhd);
 +ivhd-devid = iommu_bdf;
 +ivhd-capab_off = iommu_cap;
 +ivhd-iommu_base_low= pci_config_readl(iommu_bdf, iommu_cap + 0x04) 
 +  0xFFFE;
 +ivhd-iommu_base_high   = pci_config_readl(iommu_bdf, iommu_cap + 0x08);
 +ivhd-pci_seg_group = 0;
 +ivhd-iommu_info= 0;
 +ivhd-reserved  = 0;
 +
 +i = 0;
 +foreachpci(bdf, max) {
 +if (bdf == ivhd-devid)
 +continue;
 +ivhd-entry[4 * i + 0] = 2;
 +ivhd-entry[4 * i + 1] = bdf  0xFF;
 +ivhd-entry[4 * i + 2] = (bdf  8)  0xFF;
 +ivhd-entry[4 * i + 3] = ~(1  3);
 +ivhd-length += 4;
 +if (++i = IVRS_MAX_DEVS)
 +break;
 +}
 +
 +build_header((void *) ivrs, IVRS_SIGNATURE,
 + sizeof(struct ivrs_table) + 4 * i, 1);
 +
 +return ivrs;
 +}
 +
  static const struct pci_device_id acpi_find_tbl[] = {
  /* PIIX4 Power Management device. */
  PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371AB_3, NULL),
 @@ -625,6 +708,7 @@ acpi_bios_init(void)
  ACPI_INIT_TABLE(build_madt());
  ACPI_INIT_TABLE(build_hpet());
  ACPI_INIT_TABLE(build_srat());
 +ACPI_INIT_TABLE(build_ivrs());
  
  u16 i, external_tables = qemu_cfg_acpi_additional_tables();
  
 diff --git a/src/config.h b/src/config.h
 index 6356941..0ba5723 100644
 --- a/src/config.h
 +++ b/src/config.h
 @@ -172,6 +172,9 @@
  #define BUILD_APIC_ADDR   0xfee0
  #define BUILD_IOAPIC_ADDR 0xfec0
  
 +#define BUILD_AMD_IOMMU_START 0xfed0
 +#define BUILD_AMD_IOMMU_END   0xfee0/* BUILD_APIC_ADDR */
 +
  #define BUILD_SMM_INIT_ADDR   0x38000
  #define BUILD_SMM_ADDR0xa8000
  #define BUILD_SMM_SIZE0x8000
 diff --git a/src/pci_ids.h b/src/pci_ids.h
 index e1cded2..3cc3c6e 100644
 --- a/src/pci_ids.h
 +++ b/src/pci_ids.h
 @@ -72,6 +72,7 @@
  #define PCI_CLASS_SYSTEM_RTC 0x0803
  #define PCI_CLASS_SYSTEM_PCI_HOTPLUG 0x0804
  #define PCI_CLASS_SYSTEM_SDHCI   0x0805
 +#define PCI_CLASS_SYSTEM_IOMMU   0x0806
  #define PCI_CLASS_SYSTEM_OTHER   0x0880
  
  #define PCI_BASE_CLASS_INPUT 0x09
 diff --git a/src/pci_regs.h b/src/pci_regs.h
 index e5effd4..bfac824 100644
 --- a/src/pci_regs.h
 +++ b/src/pci_regs.h
 @@ -208,6 +208,7 @@
  #define  PCI_CAP_ID_SHPC 0x0C/* PCI Standard Hot-Plug Controller */
  #define  PCI_CAP_ID_SSVID0x0D/* Bridge subsystem vendor/device ID */
  #define  PCI_CAP_ID_AGP3 0x0E/* AGP Target

Re: [PATCH 2/3] AMD IOMMU support

2011-02-06 Thread Michael S. Tsirkin

On Sun, Feb 06, 2011 at 03:41:45PM +0200, Eduard - Gabriel Munteanu wrote:
 On Sun, Feb 06, 2011 at 01:47:57PM +0200, Michael S. Tsirkin wrote:
  On Fri, Feb 04, 2011 at 01:24:14AM +0200, Eduard - Gabriel Munteanu wrote:
 
 Hi,
 
 [snip]
 
   +/*
   + * IVRS (I/O Virtualization Reporting Structure) table.
   + *
   + * Describes the AMD IOMMU, as per:
   + * AMD I/O Virtualization Technology (IOMMU) Specification, rev 1.26
   + */
   +
   +struct ivrs_ivhd
   +{
   +u8type;
   +u8flags;
   +u16   length;
   +u16   devid;
   +u16   capab_off;
   +u32   iommu_base_low;
   +u32   iommu_base_high;
   +u16   pci_seg_group;
   +u16   iommu_info;
   +u32   reserved;
   +u8entry[0];
   +} PACKED;
   +
   +struct ivrs_table
   +{
   +ACPI_TABLE_HEADER_DEF/* ACPI common table header. */
   +u32iv_info;
   +u32reserved[2];
   +struct ivrs_ivhd   ivhd;
   +} PACKED;
   +
  
  prefix with amd_iommu_ or amd_ then ?
 
 
 This should be standard nomenclature already, even if IVRS is AMD
 IOMMU-specific.

Yes but the specific structure is amd specific, isn't it?

#include acpi-dsdt.hex

static void
   @@ -579,6 +609,59 @@ build_srat(void)
return srat;
}

   +#define IVRS_SIGNATURE 0x53525649 // IVRS
   +#define IVRS_MAX_DEVS  32
   +static void *
   +build_ivrs(void)
   +{
   +int iommu_bdf, iommu_cap;
   +int bdf, max, i;
   +struct ivrs_table *ivrs;
   +struct ivrs_ivhd *ivhd;
   +
   +/* Note this currently works for a single IOMMU! */
  
  Meant to be a FIXME?
  How hard is it to fix? Just stick this in a loop?
 
 
 I suspect a real BIOS would have these values hardcoded anyway,
 according to the topology of the PCI bus and which IOMMUs sit where.

Which values exactly?

 You
 already mentioned the possibility of multiple IOMMU capabilities in the
 same function/bus, in which case there's probably no easy way to guess
 it from SeaBIOS.

It's easy enough to enumerate capabilities and pci devices, isn't it?

 [snip]
 
   +static void amd_iommu_init(u16 bdf, void *arg)
   +{
   +int cap;
   +u32 base_addr;
   +
   +cap = pci_find_capability(bdf, PCI_CAP_ID_SEC);
   +if (cap  0) {
   +return;
   +}
  
  There actually can be multiple instances of this
  capability according to spec.
  Do we care?
  
 
 Hm, perhaps we should at least assign a base address there, that's easy.
 As for QEMU/KVM usage we probably don't need it. 

I expect assigning multiple domains will be useful.
I'm guessing multiple devices is what systems have
in this case? If so I'm not really sure why is there need
for multiple iommu capabilities per device.

   +
   +if (amd_iommu_addr = BUILD_AMD_IOMMU_END) {
   +return;
   +}
   +base_addr = amd_iommu_addr;
   +amd_iommu_addr += 0x4000;
   +
   +pci_config_writel(bdf, cap + 0x0C, 0);
   +pci_config_writel(bdf, cap + 0x08, 0);
   +pci_config_writel(bdf, cap + 0x04, base_addr | 1);
   +}
   +
static const struct pci_device_id pci_class_tbl[] = {
/* STORAGE IDE */
PCI_DEVICE_CLASS(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371SB_1,
   @@ -279,6 +302,10 @@ static const struct pci_device_id pci_class_tbl[] = {
PCI_DEVICE_CLASS(PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI,
 pci_bios_init_device_bridge),

   +/* AMD IOMMU */
  
  Makes sense to limit to AMD vendor id?
  
 
 I don't think so, I assume any PCI_CLASS_SYSTEM_IOMMU device would
 implement the same specification, considering these ids have been
 assigned by PCI-SIG.

This hasn't been the case in the past, e.g. with
PCI_CLASS_NETWORK_ETHERNET, so I see no reason to assume
it here.


   +PCI_DEVICE_CLASS(PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_SYSTEM_IOMMU,
   + amd_iommu_init),
   +
/* default */
PCI_DEVICE(PCI_ANY_ID, PCI_ANY_ID, pci_bios_allocate_regions),

   @@ -408,6 +435,8 @@ pci_setup(void)
pci_region_init(pci_bios_prefmem_region,
BUILD_PCIPREFMEM_START, BUILD_PCIPREFMEM_END - 1);

   +amd_iommu_addr = BUILD_AMD_IOMMU_START;
   +
pci_bios_init_bus();

int bdf, max;
   -- 
   1.7.3.4
 
 Thanks for your review, I read your other comments and will resubmit
 once I fix those issues.
 
 
   Eduard
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-02-08 Thread Michael S. Tsirkin

On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
 On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
   
On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
  Confused. We compare capacity to skb frags, no?
  That's sg I think ...

 Current guest kernel use indirect buffers, num_free returns how many
 available descriptors not skb frags. So it's wrong here.

 Shirley
   
I see. Good point. In other words when we complete the buffer
it was indirect, but when we add a new one we
can not allocate indirect so we consume.
And then we start the queue and add will fail.
I guess we need some kind of API to figure out
whether the buf we complete was indirect?
 
 I've finally read this thread... I think we need to get more serious
 with our stats gathering to diagnose these kind of performance issues.
 
 This is a start; it should tell us what is actually happening to the
 virtio ring(s) without significant performance impact...
 
 Subject: virtio: CONFIG_VIRTIO_STATS
 
 For performance problems we'd like to know exactly what the ring looks
 like.  This patch adds stats indexed by how-full-ring-is; we could extend
 it to also record them by how-used-ring-is if we need.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au

Not sure whether the intent is to merge this. If yes -
would it make sense to use tracing for this instead?
That's what kvm does.

 diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
 --- a/drivers/virtio/Kconfig
 +++ b/drivers/virtio/Kconfig
 @@ -7,6 +7,14 @@ config VIRTIO_RING
   tristate
   depends on VIRTIO
  
 +config VIRTIO_STATS
 + bool Virtio debugging stats (EXPERIMENTAL)
 + depends on VIRTIO_RING
 + select DEBUG_FS
 + ---help---
 +   Virtio stats collected by how full the ring is at any time,
 +   presented under debugfs/virtio/name-vq/num-used/
 +
  config VIRTIO_PCI
   tristate PCI driver for virtio devices (EXPERIMENTAL)
   depends on PCI  EXPERIMENTAL
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -21,6 +21,7 @@
  #include linux/virtio_config.h
  #include linux/device.h
  #include linux/slab.h
 +#include linux/debugfs.h
  
  /* virtio guest is communicating with a virtual device that actually runs 
 on
   * a host processor.  Memory barriers are used to control SMP effects. */
 @@ -95,6 +96,11 @@ struct vring_virtqueue
   /* How to notify other side. FIXME: commonalize hcalls! */
   void (*notify)(struct virtqueue *vq);
  
 +#ifdef CONFIG_VIRTIO_STATS
 + struct vring_stat *stats;
 + struct dentry *statdir;
 +#endif
 +
  #ifdef DEBUG
   /* They're supposed to lock for us. */
   unsigned int in_use;
 @@ -106,6 +112,87 @@ struct vring_virtqueue
  
  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
  
 +#ifdef CONFIG_VIRTIO_STATS
 +/* We have an array of these, indexed by how full the ring is. */
 +struct vring_stat {
 + /* How many interrupts? */
 + size_t interrupt_nowork, interrupt_work;
 + /* How many non-notify kicks, how many notify kicks, how many add 
 notify? */
 + size_t kick_no_notify, kick_notify, add_notify;
 + /* How many adds? */
 + size_t add_direct, add_indirect, add_fail;
 + /* How many gets? */
 + size_t get;
 + /* How many disable callbacks? */
 + size_t disable_cb;
 + /* How many enables? */
 + size_t enable_cb_retry, enable_cb_success;
 +};
 +
 +static struct dentry *virtio_stats;
 +
 +static void create_stat_files(struct vring_virtqueue *vq)
 +{
 + char name[80];
 + unsigned int i;
 +
 + /* Racy in theory, but we don't care. */
 + if (!virtio_stats)
 + virtio_stats = debugfs_create_dir(virtio-stats, NULL);
 +
 + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name);
 + vq-statdir = debugfs_create_dir(name, virtio_stats);
 +
 + for (i = 0; i  vq-vring.num; i++) {
 + struct dentry *dir;
 +
 + sprintf(name, %i, i);
 + dir = debugfs_create_dir(name, vq-statdir);
 + debugfs_create_size_t(interrupt_nowork, 0400, dir,
 +   vq-stats[i].interrupt_nowork);
 + debugfs_create_size_t(interrupt_work, 0400, dir,
 +   vq-stats[i].interrupt_work);
 + debugfs_create_size_t(kick_no_notify, 0400, dir,
 +   vq-stats[i].kick_no_notify);
 + debugfs_create_size_t(kick_notify, 0400, dir,
 +   vq-stats[i].kick_notify);
 + debugfs_create_size_t(add_notify, 0400, dir,
 +   vq-stats[i

Re: Network performance with small packets

2011-02-08 Thread Michael S. Tsirkin

On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
 On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
  On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
   On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how 
   many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
   
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
   
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact...
   
   Subject: virtio: CONFIG_VIRTIO_STATS
   
   For performance problems we'd like to know exactly what the ring looks
   like.  This patch adds stats indexed by how-full-ring-is; we could extend
   it to also record them by how-used-ring-is if we need.
   
   Signed-off-by: Rusty Russell ru...@rustcorp.com.au
  
  Not sure whether the intent is to merge this. If yes -
  would it make sense to use tracing for this instead?
  That's what kvm does.
 
 Intent wasn't; I've not used tracepoints before, but maybe we should
 consider a longer-term monitoring solution?
 
 Patch welcome!
 
 Cheers,
 Rusty.

Sure, I'll look into this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/5] macvtap TX zero copy between guest and host kernel

2011-02-14 Thread Michael S. Tsirkin

On Fri, Dec 10, 2010 at 01:51:31AM -0800, Shirley Ma wrote:
 This patchset add supports for TX zero-copy between guest and host
 kernel through vhost. It significantly reduces CPU utilization on the
 local host on which the guest is located (It reduced 30-50% CPU usage
 for vhost thread for single stream test). The patchset is based on
 previous submission and comments from the community regarding when/how
 to handle guest kernel buffers to be released. This is the simplest
 approach I can think of after comparing with several other solutions.
 
 This patchset includes:
 
 1. Induce a new sock zero-copy flag, SOCK_ZEROCOPY;
 
 2. Induce a new device flag, NETIF_F_ZEROCOPY for device can support
 zero-copy;
 
 3. Add a new struct skb_ubuf_info in skb_share_info for userspace
 buffers release callback when device DMA has done for that skb;
 
 4. Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_add_used_and_signal to notify guest to release TX
 skb buffers.
 
 5. Add macvtap zero-copy in lower device when sending packet is greater
 than 128 bytes.
 
 The patchset has passed netperf/netserver test on Chelsio, and
 continuing test on other 10GbE NICs, like Intel ixgbe, Mellanox mlx4...
 I will provide guest to host, host to guest performance data next week.
 
 However when running stress test, vhost  virtio_net seems out of sync,
 and virito_net interrupt was disabled somehow, and it stopped to send
 any packet. This problem has bothered me for a long long time, I will
 continue to look at this.
 
 Please review this.
 
 Thanks
 Shirley

What's the status here?  Since there are core net changes, we'll need to
see the final version soon if it's to appear in 2.6.39.

Could the problem be related to the patch
virtio_net: Add schedule check to napi_enable call
?
Also, I expect there should be driver patches for some
devices? Where are they?

Thanks,

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio_net sometimes didn't work

2011-02-15 Thread Michael S. Tsirkin

On Wed, Feb 16, 2011 at 09:53:25AM +0800, lidong chen wrote:
 because of some other work, i could not focus on this problem last month.
 
 now i find the wrong vaule of isr cause this problem.
 in function vp_interrupt, the isr is 0, and the virtio_balloon pci
 device have 10 times initerrupts. then IRQ #11 disabled.
 
 static irqreturn_t vp_interrupt(int irq, void *opaque)
 {
   struct virtio_pci_device *vp_dev = opaque;
   struct virtio_pci_vq_info *info;
   irqreturn_t ret = IRQ_NONE;
   unsigned long flags;
   u8 isr;
 
   /* reading the ISR has the effect of also clearing it so it's very
* important to save off the value. */
   isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR);
 
   /* It's definitely not us if the ISR was not high */
   if (!isr)
   return IRQ_NONE;   //return from here
 

This implies that io addr values got swapped between the
devices. Try lspci -vv in guest and info pci in qemu and compare
the io address values.

 
 
 2010/12/9 Michael S. Tsirkin m...@redhat.com:
  On Fri, Nov 26, 2010 at 10:38:33AM +0800, lidong chen wrote:
  Does this message appear on boot, or after some stress?
  on boot, and only appear when boot from network.
 
  Which qemu-kvm version?
  [root@kvm-4slot ~]# /usr/libexec/qemu-kvm --version
  QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c)
  2003-2008 Fabrice Bellard
 
  what happens with the latest qemu?
 
  Does cherry-picking 3fff0179e33cd7d0a688dab65700c46ad089e934 help?
  the virtio_pci have already used this patch, still have this problem.
 
  What does info irqs show in qemu?
  how to collect this information?
 
  but I found if modify the slot number of balloon device from 0x03 to
  0x09, the problem solved.
 
      memballoon model='virtio'
        alias name='balloon0'/
        address type='pci' domain='0x' bus='0x00' slot='0x09'
  function='0x0'/
      /memballoon
 
  Interesting. Is it possible that even after baloon is moved,
  there's still the message in guest, only this time
  things keep going afterwards?
 
  and i found someone else also meet this problem.
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/584675
 
 
 
  2010/11/25 Michael S. Tsirkin m...@redhat.com:
   On Thu, Nov 25, 2010 at 10:21:24PM +0800, lidong chen wrote:
   [version]
   the host os version is 2.6.32
  
   Which qemu-kvm version?
  
   the guest os version is 2.6.16
  
   [dmesg]
   ACPI: (supports S3 S4 S5)
   Freeing unused kernel memory: 200k freed
   input: ImExPS/2 Generic Explorer Mouse as /class/input/input2
   ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
   ACPI: PCI Interrupt :00:03.0[A] - Link [LNKC] - GSI 11 (level,
   high) - IRQ 11
   io address 0001c040ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
   ACPI: PCI Interrupt :00:04.0[A] - Link [LNKD] - GSI 10 (level,
   high) - IRQ 10
   io address 0001c060ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
   ACPI: PCI Interrupt :00:05.0[A] - Link [LNKA] - GSI 10 (level,
   high) - IRQ 10
   io address 0001c080ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
   ACPI: PCI Interrupt :00:06.0[A] - Link [LNKB] - GSI 11 (level,
   high) - IRQ 11
   io address 0001c0a06ACPI: PCI Interrupt :00:07.0[A] - Link
   [LNKC] - GSI 11 (level, high) - IRQ 11
   io address 0001c0c06ACPI: PCI Interrupt :00:08.0[A] - Link
   [LNKD] - GSI 10 (level, high) - IRQ 10
   io address 0001c0e0
   irq 11: nobody cared (try booting with the irqpoll option)
    [c01457b0] __report_bad_irq+0x2b/0x69
    [c0145979] note_interrupt+0x18b/0x1b2
    [c01452a9] handle_IRQ_event+0x26/0x51
    [c014537f] __do_IRQ+0xab/0xdc
    [c0106445] do_IRQ+0x46/0x53
    [c0104e8a] common_interrupt+0x1a/0x20
    [c01276f2] __do_softirq+0x4f/0xc2
    [c0127793] do_softirq+0x2e/0x32
    [c0104f3c] apic_timer_interrupt+0x1c/0x30
    [c0102d55] default_idle+0x2e/0x5c
    [c0102e14] cpu_idle+0x91/0xad
    [c03946e5] start_kernel+0x34c/0x353
   handlers:
   [f88252ee] (vp_interrupt+0x0/0x3e [virtio_pci])
   Disabling IRQ #11
  
   Does this message appear on boot, or after some stress?
   Does cherry-picking 3fff0179e33cd7d0a688dab65700c46ad089e934 help?
   Happens with a newer kernel as guest?
   What does info irqs show in qemu?
  
  
   --
   MST
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio_net sometimes didn't work

2011-02-16 Thread Michael S. Tsirkin

- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Region 0: Memory at f000 (32-bit, prefetchable) [size=32M]
   Region 1: Memory at f200 (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f201 [disabled] [size=64K]
 
 00:03.0 RAM memory: Unknown device 1af4:1002
   Subsystem: Unknown device 1af4:0005
   Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Interrupt: pin A routed to IRQ 11
   Region 0: I/O ports at c040 [size=32]
 
 00:04.0 Ethernet controller: Unknown device 1af4:1000
   Subsystem: Unknown device 1af4:0001
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Latency: 0
   Interrupt: pin A routed to IRQ 10
   Region 0: I/O ports at c060 [size=32]
   Region 1: Memory at f202 (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f203 [disabled] [size=64K]
   Capabilities: [40] MSI-X: Enable- Mask- TabSize=3
   Vector table: BAR=1 offset=
   PBA: BAR=1 offset=0800
 
 00:05.0 Ethernet controller: Unknown device 1af4:1000
   Subsystem: Unknown device 1af4:0001
   Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Interrupt: pin A routed to IRQ 10
   Region 0: I/O ports at c080 [size=32]
   Region 1: Memory at f204 (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f205 [disabled] [size=64K]
   Capabilities: [40] MSI-X: Enable- Mask- TabSize=3
   Vector table: BAR=1 offset=
   PBA: BAR=1 offset=0800
 
 00:06.0 Ethernet controller: Unknown device 1af4:1000
   Subsystem: Unknown device 1af4:0001
   Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Interrupt: pin A routed to IRQ 11
   Region 0: I/O ports at c0a0 [size=32]
   Region 1: Memory at f206 (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f207 [disabled] [size=64K]
   Capabilities: [40] MSI-X: Enable- Mask- TabSize=3
   Vector table: BAR=1 offset=
   PBA: BAR=1 offset=0800
 
 00:07.0 Ethernet controller: Unknown device 1af4:1000
   Subsystem: Unknown device 1af4:0001
   Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Interrupt: pin A routed to IRQ 11
   Region 0: I/O ports at c0c0 [size=32]
   Region 1: Memory at f208 (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f209 [disabled] [size=64K]
   Capabilities: [40] MSI-X: Enable- Mask- TabSize=3
   Vector table: BAR=1 offset=
   PBA: BAR=1 offset=0800
 
 00:08.0 Ethernet controller: Unknown device 1af4:1000
   Subsystem: Unknown device 1af4:0001
   Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
   Interrupt: pin A routed to IRQ 10
   Region 0: I/O ports at c0e0 [size=32]
   Region 1: Memory at f20a (32-bit, non-prefetchable) [size=4K]
   Expansion ROM at f20b [disabled] [size=64K]
   Capabilities: [40] MSI-X: Enable- Mask- TabSize=3
   Vector table: BAR=1 offset=
   PBA: BAR=1 offset=0800
 
 2011/2/16 Michael S. Tsirkin m...@redhat.com:
  On Wed, Feb 16, 2011 at 09:53:25AM +0800, lidong chen wrote:
  because of some other work, i could not focus on this problem last month.
 
  now i find the wrong vaule of isr cause this problem.
  in function vp_interrupt, the isr is 0, and the virtio_balloon pci
  device have 10 times initerrupts. then IRQ #11 disabled.
 
  static irqreturn_t vp_interrupt(int irq, void *opaque)
  {
        struct virtio_pci_device *vp_dev = opaque;
        struct virtio_pci_vq_info *info;
        irqreturn_t ret = IRQ_NONE;
        unsigned long flags;
        u8 isr;
 
        /* reading the ISR has the effect of also clearing it so it's very
         * important to save off the value. */
        isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR);
 
        /* It's definitely not us if the ISR was not high */
        if (!isr)
                return IRQ_NONE;   //return from here
 
 
  This implies that io addr values got swapped between the
  devices. Try lspci -vv in guest

KVM call minutes for Feb 22

2011-02-22 Thread Michael S. Tsirkin

Looks like Chris will send minutes too,
so I didn't do much to polish this,
I didn't realise he's doing it until I had this, so
here's the braindump: hope it helps.

1. 0.14 postmortem
- what went well
wiki for planning
testing
- what can be improved
rc - cycle could be longer

2. 0.15 should be end of June
   July 1?
   - features:
 QED
 other?
 virtagent? might be blocked by the rpc issue (discussion followed)

3. virtagent rpc
 what should virtagent use to talk to the world outside of QEMU?
 Generalize QMP?
 Use an external rpc library?
 Do we want to/can we have virtagent out of process?
 Hard to make well with live migration.
 Would be the most beneficial if we can make it stateless.
 Might be good for security but hard to integrate.

4. qcow3
 there will be a wiki with features we want there

5. google summer of code
we only have 3 projects
need more projects/mentors, we have another week

6. switch from gpxe to ipxe
needed to get patches applied
we'll likely switch
need some testing
might be possible by 0.15

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Michael S. Tsirkin

On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
 Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM:
 
 Hi Simon,
 
 
  I have a few questions about the results below:
 
  1. Are the (%) comparisons between non-mq and mq virtio?
 
 Yes - mainline kernel with transmit-only MQ patch.
 
  2. Was UDP or TCP used?
 
 TCP. I had done some initial testing on UDP, but don't have
 the results now as it is really old. But I will be running
 it again.
 
  3. What was the transmit size (-m option to netperf)?
 
 I didn't use the -m option, so it defaults to 16K. The
 script does:
 
 netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
 
  Also, I'm interested to know what the status of these patches is.
  Are you planing a fresh series?
 
 Yes. Michael Tsirkin had wanted to see how the MQ RX patch
 would look like, so I was in the process of getting the two
 working together. The patch is ready and is being tested.
 Should I send a RFC patch at this time?

Yes, please do.

 The TX-only patch helped the guest TX path but didn't help
 host-guest much (as tested using TCP_MAERTS from the guest).
 But with the TX+RX patch, both directions are getting
 improvements.

Also, my hope is that with appropriate queue mapping,
we might be able to do away with heuristics to detect
single stream load that TX only code needs.

 Remote testing is still to be done.

Others might be able to help here once you post the patch.

 Thanks,
 
 - KK
 
 Changes from rev2:
 --
   1. Define (in virtio_net.h) the maximum send txqs; and use in
  virtio-net and vhost-net.
   2. vi-sq[i] is allocated individually, resulting in cache line
  aligned sq[0] to sq[n].  Another option was to define
  'send_queue' as:
  struct send_queue {
  struct virtqueue *svq;
  struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
  } cacheline_aligned_in_smp;
  and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
  the submitted method is preferable.
   3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
  handles TX[0-n].
   4. Further change TX handling such that vhost[0] handles both RX/TX
  for single stream case.
  
 Enabling MQ on virtio:
 ---
   When following options are passed to qemu:
   - smp  1
   - vhost=on
   - mq=on (new option, default:off)
   then #txqueues = #cpus.  The #txqueues can be changed by using an
   optional 'numtxqs' option.  e.g. for a smp=4 guest:
   vhost=on   -   #txqueues = 1
   vhost=on,mq=on -   #txqueues = 4
   vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
   vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
  
  
  Performance (guest - local host):
  ---
   System configuration:
   Host:  8 Intel Xeon, 8 GB memory
   Guest: 4 cpus, 2 GB memory
   Test: Each test case runs for 60 secs, sum over three runs (except
   when number of netperf sessions is 1, which has 10 runs of 12 secs
   each).  No tuning (default netperf) other than taskset vhost's to
   cpus 0-3.  numtxqs=32 gave the best results though the guest had
   only 4 vcpus (I haven't tried beyond that).
  
   __ numtxqs=2, vhosts=3  
   #sessions  BW%  CPU%RCPU%SD%  RSD%
   
   1  4.46-1.96 .19 -12.50   -6.06
   2  4.93-1.162.10  0   -2.38
   4  46.1764.77   33.72 19.51   -2.48
   8  47.8970.00   36.23 41.4613.35
   16 48.9780.44   40.67 21.11   -5.46
   24 49.0378.78   41.22 20.51   -4.78
   32 51.1177.15   42.42 15.81   -6.87
   40 51.6071.65   42.43 9.75-8.94
   48 50.1069.55   42.85 11.80   -5.81
   64 46.2468.42   42.67 14.18   -3.28
   80 46.3763.13   41.62 7.43-6.73
   96 46.4063.31   42.20 9.36-4.78
   12850.4362.79   42.16 13.11   -1.23
   
   BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
  
   __ numtxqs=8, vhosts=5  
   #sessions   BW%  CPU% RCPU% SD%  RSD%
   
   1   -.76-1.56 2.33  03.03
   2   17.4111.1111.41 0   -4.76
   4   42.1255.1130.20 19.51.62
   8   54.6980.0039.22 24.39-3.88
   16  54.7781.6240.89 20.34-6.58
   24  54.6679.6841.57 15.49-8.99
   32  54.9276.8241.79 17.59-5.70
   40

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-02-23 Thread Michael S. Tsirkin

On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote:
 On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote:
  On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote:
   Then we can support mask bit operation of assigned devices now.
  
  Looks pretty good overall.  A few comments below.  It seems like we
  should be able to hook this into vfio with a small stub in kvm.  We just
  need to be able to communicate disabling and enabling of individual msix
  vectors.  For brute force, we could do this with a lot of eventfds,
  triggered by kvm and consumed by vfio, two per MSI-X vector.  Not sure
  if there's something smaller that could do it.  Thanks,
 
 Alex, thanks for your comments. See my comments below:
 
  
  Alex
  
   Signed-off-by: Sheng Yang sh...@linux.intel.com
   ---
   
arch/x86/kvm/Makefile|2 +-
arch/x86/kvm/x86.c   |8 +-
include/linux/kvm.h  |   21 
include/linux/kvm_host.h |   25 
virt/kvm/assigned-dev.c  |   44 +++
virt/kvm/kvm_main.c  |   38 ++-
virt/kvm/msix_mmio.c |  286
++ virt/kvm/msix_mmio.h
|   25 
8 files changed, 442 insertions(+), 7 deletions(-)
create mode 100644 virt/kvm/msix_mmio.c
create mode 100644 virt/kvm/msix_mmio.h
   
   diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
   index f15501f..3a0d851 100644
   --- a/arch/x86/kvm/Makefile
   +++ b/arch/x86/kvm/Makefile
   @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
   
kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o 
   ioapic.o \

 coalesced_mmio.o irq_comm.o eventfd.o \
   
   - assigned-dev.o)
   + assigned-dev.o msix_mmio.o)
   
kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, iommu.o)
kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/,
async_pf.o)
   
   diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
   index fa708c9..89bf12c 100644
   --- a/arch/x86/kvm/x86.c
   +++ b/arch/x86/kvm/x86.c
   @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   
 case KVM_CAP_X86_ROBUST_SINGLESTEP:
 case KVM_CAP_XSAVE:
   
 case KVM_CAP_ASYNC_PF:
   + case KVM_CAP_MSIX_MMIO:
 r = 1;
 break;
 
 case KVM_CAP_COALESCED_MMIO:
   @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned
   long addr,
   
struct kvm_vcpu *vcpu)

{

 gpa_t gpa;
   
   + int r;
   
 gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
   
   @@ -3822,14 +3824,16 @@ static int
   emulator_write_emulated_onepage(unsigned long addr,
   
mmio:
 trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
   
   + r = vcpu_mmio_write(vcpu, gpa, bytes, val);
   
 /*
 
  * Is this MMIO handled locally?
  */
   
   - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
   + if (!r)
   
 return X86EMUL_CONTINUE;
 
 vcpu-mmio_needed = 1;
   
   - vcpu-run-exit_reason = KVM_EXIT_MMIO;
   + vcpu-run-exit_reason = (r == -ENOTSYNC) ?
   + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO;

This use of -ENOTSYNC is IMO confusing.
How about we make vcpu_mmio_write return the positive exit reason?
Negative value will mean an error.

   
 vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
 vcpu-run-mmio.len = vcpu-mmio_size = bytes;
 vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
   
   diff --git a/include/linux/kvm.h b/include/linux/kvm.h
   index ea2dc1a..ad9df4b 100644
   --- a/include/linux/kvm.h
   +++ b/include/linux/kvm.h
   @@ -161,6 +161,7 @@ struct kvm_pit_config {
   
#define KVM_EXIT_NMI  16
#define KVM_EXIT_INTERNAL_ERROR   17
#define KVM_EXIT_OSI  18
   
   +#define KVM_EXIT_MSIX_ROUTING_UPDATE 19
   
/* For KVM_EXIT_INTERNAL_ERROR */
#define KVM_INTERNAL_ERROR_EMULATION 1
   
   @@ -541,6 +542,7 @@ struct kvm_ppc_pvinfo {
   
#define KVM_CAP_PPC_GET_PVINFO 57
#define KVM_CAP_PPC_IRQ_LEVEL 58
#define KVM_CAP_ASYNC_PF 59
   
   +#define KVM_CAP_MSIX_MMIO 60
   
#ifdef KVM_CAP_IRQ_ROUTING
   
   @@ -672,6 +674,9 @@ struct kvm_clock_data {
   
#define KVM_XEN_HVM_CONFIG_IOW(KVMIO,  0x7a, struct
kvm_xen_hvm_config) #define KVM_SET_CLOCK _IOW(KVMIO, 
0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
_IOR(KVMIO,  0x7c, struct kvm_clock_data)
   
   +/* Available with KVM_CAP_MSIX_MMIO */
   +#define KVM_REGISTER_MSIX_MMIO_IOW(KVMIO,  0x7d, struct
   kvm_msix_mmio_user) +#define KVM_UNREGISTER_MSIX_MMIO  _IOW(KVMIO, 
   0x7e, struct kvm_msix_mmio_user)
   
/* Available with KVM_CAP_PIT_STATE2 */
#define KVM_GET_PIT2  _IOR(KVMIO,  0x9f, struct
kvm_pit_state2) #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0,
struct

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-23 Thread Michael S. Tsirkin

On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM:
 
 Hi Michael,
 
   Yes. Michael Tsirkin had wanted to see how the MQ RX patch
   would look like, so I was in the process of getting the two
   working together. The patch is ready and is being tested.
   Should I send a RFC patch at this time?
 
  Yes, please do.
 
 Sure, will get a build/test on latest bits and send in 1-2 days.
 
   The TX-only patch helped the guest TX path but didn't help
   host-guest much (as tested using TCP_MAERTS from the guest).
   But with the TX+RX patch, both directions are getting
   improvements.
 
  Also, my hope is that with appropriate queue mapping,
  we might be able to do away with heuristics to detect
  single stream load that TX only code needs.
 
 Yes, that whole stuff is removed, and the TX/RX path is
 unchanged with this patch (thankfully :)

Cool. I was wondering whether in that case, we can
do without host kernel changes at all,
and use a separate fd for each TX/RX pair.
The advantage of that approach is that this way,
the max fd limit naturally sets an upper bound
on the amount of resources userspace can use up.

Thoughts?

In any case, pls don't let the above delay
sending an RFC.

   Remote testing is still to be done.
 
  Others might be able to help here once you post the patch.
 
 That's great, will appreciate any help.
 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-02-23 Thread Michael S. Tsirkin

On Wed, Feb 23, 2011 at 09:34:19AM -0700, Alex Williamson wrote:
 On Wed, 2011-02-23 at 14:59 +0800, Sheng Yang wrote:
  On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote:
   On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote:
+static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr,
int len, +  void *val)
+{
+   struct kvm_msix_mmio_dev *mmio_dev =
+   container_of(this, struct kvm_msix_mmio_dev, table_dev);
+   struct kvm_msix_mmio *mmio;
+   int idx, ret = 0, entry, offset, r;
+
+   mutex_lock(mmio_dev-lock);
+   idx = get_mmio_table_index(mmio_dev, addr, len);
+   if (idx  0) {
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   if ((addr  0x3) || (len != 4  len != 8))
+   goto out;
+
+   offset = addr  0xf;
+   if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL  len == 8)
+   goto out;
+
+   mmio = mmio_dev-mmio[idx];
+   entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE;
+   r = copy_from_user(val, (void __user *)(mmio-table_base_va +
+   entry * PCI_MSIX_ENTRY_SIZE + offset), len);
+   if (r)
+   goto out;
+out:
+   mutex_unlock(mmio_dev-lock);
+   return ret;
+}
+
+static int msix_table_mmio_write(struct kvm_io_device *this, gpa_t 
addr,
+   int len, const void *val)
+{
+   struct kvm_msix_mmio_dev *mmio_dev =
+   container_of(this, struct kvm_msix_mmio_dev, table_dev);
+   struct kvm_msix_mmio *mmio;
+   int idx, entry, offset, ret = 0, r = 0;
+   gpa_t entry_base;
+   u32 old_ctrl, new_ctrl;
+   u32 *ctrl_pos;
+
+   mutex_lock(mmio_dev-kvm-lock);
+   mutex_lock(mmio_dev-lock);
+   idx = get_mmio_table_index(mmio_dev, addr, len);
+   if (idx  0) {
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   if ((addr  0x3) || (len != 4  len != 8))
+   goto out;
+
+   offset = addr  0xF;
   
   nit, this 0xf above.
  
  So you like another mask?
 
 I'm just noting that above you used 0xf and here you used 0xF.
 
+   if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL  len == 8)
+   goto out;
+
+   mmio = mmio_dev-mmio[idx];
+   entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE;
+   entry_base = mmio-table_base_va + entry * PCI_MSIX_ENTRY_SIZE;
+   ctrl_pos = (u32 *)(entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+   if (get_user(old_ctrl, ctrl_pos))
+   goto out;
+
+   /* No allow writing to other fields when entry is unmasked */
+   if (!(old_ctrl  PCI_MSIX_ENTRY_CTRL_MASKBIT) 
+   offset != PCI_MSIX_ENTRY_VECTOR_CTRL)
+   goto out;
+
+   if (copy_to_user((void __user *)(entry_base + offset), val, 
len))
+   goto out;
+
+   if (get_user(new_ctrl, ctrl_pos))
+   goto out;
+
+   if ((offset  PCI_MSIX_ENTRY_VECTOR_CTRL  len == 4) ||
+   (offset  PCI_MSIX_ENTRY_DATA  len == 8))
+   ret = -ENOTSYNC;
   
   Couldn't we avoid the above get_user(new_ctrl,...) if we tested this
   first?  I'd also prefer to see a direct goto out here since it's not
   entirely obvious that this first 'if' always falls into the below 'if'.
   I'm not a fan of using an error code as a special non-error return
   either.
  
  In fact when offset==PCI_MSIX_ENTRY_DATA and len ==8, we need to check 
  new_ctrl to 
  see if they indeed modified ctrl bits.
  
  I would try to make the logic here more clear. 
 
 Isn't that what you're testing for though?
 
 (offset  PCI_MSIX_ENTRY_VECTOR_CTRL  len == 4)
 
 If this is true, we can only be writing address, upper address, or data,
 not control.
 
 (offset  PCI_MSIX_ENTRY_DATA  len == 8)
 
 If this is true, we're writing address and upper address, not the
 control vector.
 
 Additionally, I think the size an alignment test should be more
 restrictive.  We really only want to allow naturally aligned 4 or 8 byte
 accesses.  Maybe instead of:
 
 if ((addr  0x3) || (len != 4  len != 8))
 
 we should do this:
 
 if (!(len == 4 || len == 8) || addr  (len - 1))
 
 Then the test could become:
 
 if (offset + len = PCI_MSIX_ENTRY_VECTOR_CTRL)
 
 Thanks,
 
 Alex

Or rather

 if (offset + len != PCI_MSIX_ENTRY_VECTOR_CTRL + 4)

we are matching offset + length
to control offset + control length,
and avoid depending on the fact control is the last register.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [ISC-Bugs #22806] [PATCH] lfp: fix AF_INET checksum with csum offloading

2011-02-23 Thread Michael S. Tsirkin

No answer from isc so far.
I think it's a problem that the popular dhclient
has had this bug for so long.
Anyone knows what else can be done?


On Mon, Dec 27, 2010 at 07:16:32PM +0200, Michael S. Tsirkin wrote:
 When an AF_INET socket is used, linux would sometimes
 return a packet without the checksum.  This happens when a packet
 originates on the same machine, which is most common with virtual
 machines but might be possible even without with a software-based dhcp
 server such as dnsmasq.
 This appears to be a performance optimization to avoid calculating
 the checksum and there's no way to disable this. Users are required to
 detect such packets by passing a msg_control parameter to the recvmsg
 call. This was added in linux kernel 2.6.21.
 
 dhclient from isc.org as of 4.2.0-P2 fails to do this and
 concequently discards such packets, so such setups fail to get an IP.
 This was reported in the past:
 https://lists.isc.org/mailman/htdig/dhcp-hackers/2010-April/001825.html
 
 The attached patch fixes this by passing msg_control and looking
 at the tp_status field in the result. If the TP_STATUS_CSUMNOTREADY
 bit is set, the packet checksum is missing because the packet
 is local.
 
 The patch below is currently successfully used at least on Fedora and
 some other Red Hat based distributions. Applying this to the ISC sources
 would help make the problem go away for everyone.
 ---
 
 Notes:
 - The patch is to be applied with -p1.
 - The patch below applies to dhcp 4.2.0-P2 and 4.1.2 (the last with an 
 offset).
 - Please Cc me on comments directly, I am not subscribed to the dhcp-hackers 
 list.
 
  common/bpf.c |2 +-
  common/dlpi.c|2 +-
  common/lpf.c |   81 
 ++
  common/nit.c |2 +-
  common/packet.c  |4 +-
  common/upf.c |2 +-
  includes/dhcpd.h |2 +-
  7 files changed, 70 insertions(+), 25 deletions(-)
 
 diff --git a/common/bpf.c b/common/bpf.c
 index b0ef657..8bd5727 100644
 --- a/common/bpf.c
 +++ b/common/bpf.c
 @@ -485,7 +485,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
   offset = decode_udp_ip_header (interface,
  interface - rbuf,
  interface - rbuf_offset,
 -from, hdr.bh_caplen, paylen);
 +from, hdr.bh_caplen, paylen, 0);
  
   /* If the IP or UDP checksum was bad, skip the packet... */
   if (offset  0) {
 diff --git a/common/dlpi.c b/common/dlpi.c
 index eb64342..d4a8bb9 100644
 --- a/common/dlpi.c
 +++ b/common/dlpi.c
 @@ -694,7 +694,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
   length -= offset;
  #endif
   offset = decode_udp_ip_header (interface, dbuf, bufix,
 -from, length, paylen);
 +from, length, paylen, 0);
  
   /*
* If the IP or UDP checksum was bad, skip the packet...
 diff --git a/common/lpf.c b/common/lpf.c
 index f727b7c..4bdb0f1 100644
 --- a/common/lpf.c
 +++ b/common/lpf.c
 @@ -29,18 +29,33 @@
  #include dhcpd.h
  #if defined (USE_LPF_SEND) || defined (USE_LPF_RECEIVE)
  #include sys/ioctl.h
 +#include sys/socket.h
  #include sys/uio.h
  #include errno.h
  
  #include asm/types.h
  #include linux/filter.h
  #include linux/if_ether.h
 +#include linux/if_packet.h
  #include netinet/in_systm.h
  #include includes/netinet/ip.h
  #include includes/netinet/udp.h
  #include includes/netinet/if_ether.h
  #include net/if.h
  
 +#ifndef PACKET_AUXDATA
 +#define PACKET_AUXDATA 8
 +
 +struct tpacket_auxdata
 +{
 + __u32   tp_status;
 + __u32   tp_len;
 + __u32   tp_snaplen;
 + __u16   tp_mac;
 + __u16   tp_net;
 +};
 +#endif
 +
  /* Reinitializes the specified interface after an address change.   This
 is not required for packet-filter APIs. */
  
 @@ -66,10 +81,14 @@ int if_register_lpf (info)
   struct interface_info *info;
  {
   int sock;
 - struct sockaddr sa;
 + union {
 + struct sockaddr_ll ll;
 + struct sockaddr common;
 + } sa;
 + struct ifreq ifr;
  
   /* Make an LPF socket. */
 - if ((sock = socket(PF_PACKET, SOCK_PACKET,
 + if ((sock = socket(PF_PACKET, SOCK_RAW,
  htons((short)ETH_P_ALL)))  0) {
   if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT ||
   errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT ||
 @@ -84,11 +103,16 @@ int if_register_lpf (info)
   log_fatal (Open a socket for LPF: %m);
   }
  
 + memset (ifr, 0, sizeof ifr);
 + strncpy (ifr.ifr_name, (const char *)info - ifp, sizeof ifr.ifr_name);
 + if (ioctl (sock, SIOCGIFINDEX, ifr))
 + log_fatal (Failed to get interface index: %m);
 +
   /* Bind

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-02-24 Thread Michael S. Tsirkin

On Thu, Feb 24, 2011 at 04:08:22PM +0800, Sheng Yang wrote:
 On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote:
  On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote:
   On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote:
On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote:
 Then we can support mask bit operation of assigned devices now.

Looks pretty good overall.  A few comments below.  It seems like we
should be able to hook this into vfio with a small stub in kvm.  We
just need to be able to communicate disabling and enabling of
individual msix vectors.  For brute force, we could do this with a lot
of eventfds, triggered by kvm and consumed by vfio, two per MSI-X
vector.  Not sure if there's something smaller that could do it. 
Thanks,
   
   Alex, thanks for your comments. See my comments below:
Alex

 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
 
  arch/x86/kvm/Makefile|2 +-
  arch/x86/kvm/x86.c   |8 +-
  include/linux/kvm.h  |   21 
  include/linux/kvm_host.h |   25 
  virt/kvm/assigned-dev.c  |   44 +++
  virt/kvm/kvm_main.c  |   38 ++-
  virt/kvm/msix_mmio.c |  286
  ++ virt/kvm/msix_mmio.h
  
  |   25 
  
  8 files changed, 442 insertions(+), 7 deletions(-)
  create mode 100644 virt/kvm/msix_mmio.c
  create mode 100644 virt/kvm/msix_mmio.h
 
 diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
 index f15501f..3a0d851 100644
 --- a/arch/x86/kvm/Makefile
 +++ b/arch/x86/kvm/Makefile
 @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
 
  kvm-y+= $(addprefix ../../../virt/kvm/, 
 kvm_main.o ioapic.o 
 \
  
   coalesced_mmio.o irq_comm.o eventfd.o \
 
 - assigned-dev.o)
 + assigned-dev.o msix_mmio.o)
 
  kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, 
 iommu.o)
  kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/,
  async_pf.o)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fa708c9..89bf12c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 
   case KVM_CAP_X86_ROBUST_SINGLESTEP:
   case KVM_CAP_XSAVE:
 
   case KVM_CAP_ASYNC_PF:
 + case KVM_CAP_MSIX_MMIO:
   r = 1;
   break;
   
   case KVM_CAP_COALESCED_MMIO:
 @@ -3807,6 +3808,7 @@ static int
 emulator_write_emulated_onepage(unsigned long addr,
 
  struct kvm_vcpu *vcpu)
  
  {
  
   gpa_t gpa;
 
 + int r;
 
   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
 
 @@ -3822,14 +3824,16 @@ static int
 emulator_write_emulated_onepage(unsigned long addr,
 
  mmio:
   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
 
 + r = vcpu_mmio_write(vcpu, gpa, bytes, val);
 
   /*
   
* Is this MMIO handled locally?
*/
 
 - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
 + if (!r)
 
   return X86EMUL_CONTINUE;
   
   vcpu-mmio_needed = 1;
 
 - vcpu-run-exit_reason = KVM_EXIT_MMIO;
 + vcpu-run-exit_reason = (r == -ENOTSYNC) ?
 + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO;
  
  This use of -ENOTSYNC is IMO confusing.
  How about we make vcpu_mmio_write return the positive exit reason?
  Negative value will mean an error.
 
 Make sense. I would update it.
  
   vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
   vcpu-run-mmio.len = vcpu-mmio_size = bytes;
   vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
 
 diff --git a/include/linux/kvm.h b/include/linux/kvm.h
 index ea2dc1a..ad9df4b 100644
 --- a/include/linux/kvm.h
 +++ b/include/linux/kvm.h
 @@ -161,6 +161,7 @@ struct kvm_pit_config {
 
  #define KVM_EXIT_NMI  16
  #define KVM_EXIT_INTERNAL_ERROR   17
  #define KVM_EXIT_OSI  18
 
 +#define KVM_EXIT_MSIX_ROUTING_UPDATE 19
 
  /* For KVM_EXIT_INTERNAL_ERROR */
  #define KVM_INTERNAL_ERROR_EMULATION 1
 
 @@ -541,6 +542,7 @@ struct kvm_ppc_pvinfo {
 
  #define KVM_CAP_PPC_GET_PVINFO 57
  #define KVM_CAP_PPC_IRQ_LEVEL 58
  #define KVM_CAP_ASYNC_PF 59
 
 +#define KVM_CAP_MSIX_MMIO 60
 
  #ifdef KVM_CAP_IRQ_ROUTING
 
 @@ -672,6 +674,9 @@ struct kvm_clock_data {
 
  #define KVM_XEN_HVM_CONFIG_IOW(KVMIO,  0x7a, struct
  kvm_xen_hvm_config

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-02-24 Thread Michael S. Tsirkin

On Thu, Feb 24, 2011 at 05:44:20PM +0800, Sheng Yang wrote:
 On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote:
  On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote:
   On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote:
On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote:
 Then we can support mask bit operation of assigned devices now.

Looks pretty good overall.  A few comments below.  It seems like we
should be able to hook this into vfio with a small stub in kvm.  We
just need to be able to communicate disabling and enabling of
individual msix vectors.  For brute force, we could do this with a lot
of eventfds, triggered by kvm and consumed by vfio, two per MSI-X
vector.  Not sure if there's something smaller that could do it. 
Thanks,
   
   Alex, thanks for your comments. See my comments below:
Alex

 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
 
  arch/x86/kvm/Makefile|2 +-
  arch/x86/kvm/x86.c   |8 +-
  include/linux/kvm.h  |   21 
  include/linux/kvm_host.h |   25 
  virt/kvm/assigned-dev.c  |   44 +++
  virt/kvm/kvm_main.c  |   38 ++-
  virt/kvm/msix_mmio.c |  286
  ++ virt/kvm/msix_mmio.h
  
  |   25 
  
  8 files changed, 442 insertions(+), 7 deletions(-)
  create mode 100644 virt/kvm/msix_mmio.c
  create mode 100644 virt/kvm/msix_mmio.h
 
 diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
 index f15501f..3a0d851 100644
 --- a/arch/x86/kvm/Makefile
 +++ b/arch/x86/kvm/Makefile
 @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
 
  kvm-y+= $(addprefix ../../../virt/kvm/, 
 kvm_main.o ioapic.o 
 \
  
   coalesced_mmio.o irq_comm.o eventfd.o \
 
 - assigned-dev.o)
 + assigned-dev.o msix_mmio.o)
 
  kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, 
 iommu.o)
  kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/,
  async_pf.o)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fa708c9..89bf12c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 
   case KVM_CAP_X86_ROBUST_SINGLESTEP:
   case KVM_CAP_XSAVE:
 
   case KVM_CAP_ASYNC_PF:
 + case KVM_CAP_MSIX_MMIO:
   r = 1;
   break;
   
   case KVM_CAP_COALESCED_MMIO:
 @@ -3807,6 +3808,7 @@ static int
 emulator_write_emulated_onepage(unsigned long addr,
 
  struct kvm_vcpu *vcpu)
  
  {
  
   gpa_t gpa;
 
 + int r;
 
   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
 
 @@ -3822,14 +3824,16 @@ static int
 emulator_write_emulated_onepage(unsigned long addr,
 
  mmio:
   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
 
 + r = vcpu_mmio_write(vcpu, gpa, bytes, val);
 
   /*
   
* Is this MMIO handled locally?
*/
 
 - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
 + if (!r)
 
   return X86EMUL_CONTINUE;
   
   vcpu-mmio_needed = 1;
 
 - vcpu-run-exit_reason = KVM_EXIT_MMIO;
 + vcpu-run-exit_reason = (r == -ENOTSYNC) ?
 + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO;
  
  This use of -ENOTSYNC is IMO confusing.
  How about we make vcpu_mmio_write return the positive exit reason?
  Negative value will mean an error.
 
 In fact currently nagative value means something more need to be done, the 
 same as 
 MMIO exit.


So it would be
if (!r)
return X86EMUL_CONTINUE;

vcpu-run-exit_reason = r;

 Now I think we can keep it, or update them all later.

The way to do this would be
1. patch to return KVM_EXIT_MMIO on mmio
2. your patch that returns KVM_EXIT_MSIX_ROUTING_UPDATE on top

 
 --
 regards
 Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler

2011-02-24 Thread Michael S. Tsirkin

On Thu, Feb 24, 2011 at 05:51:03PM +0800, Sheng Yang wrote:
 Add a new parameter to IO writing handler, so that we can transfer information
 from IO handler to caller.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
  arch/x86/kvm/i8254.c  |6 --
  arch/x86/kvm/i8259.c  |3 ++-
  arch/x86/kvm/lapic.c  |3 ++-
  arch/x86/kvm/x86.c|   13 -
  include/linux/kvm_host.h  |   12 ++--
  virt/kvm/coalesced_mmio.c |3 ++-
  virt/kvm/eventfd.c|2 +-
  virt/kvm/ioapic.c |2 +-
  virt/kvm/iodev.h  |6 --
  virt/kvm/kvm_main.c   |4 ++--
  10 files changed, 36 insertions(+), 18 deletions(-)
 
 diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
 index efad723..bd8f0c5 100644
 --- a/arch/x86/kvm/i8254.c
 +++ b/arch/x86/kvm/i8254.c
 @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr)
  }
  
  static int pit_ioport_write(struct kvm_io_device *this,
 - gpa_t addr, int len, const void *data)
 + gpa_t addr, int len, const void *data,
 + struct kvm_io_ext_data *ext_data)
  {
   struct kvm_pit *pit = dev_to_pit(this);
   struct kvm_kpit_state *pit_state = pit-pit_state;
 @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device *this,
  }
  
  static int speaker_ioport_write(struct kvm_io_device *this,
 - gpa_t addr, int len, const void *data)
 + gpa_t addr, int len, const void *data,
 + struct kvm_io_ext_data *ext_data)
  {
   struct kvm_pit *pit = speaker_to_pit(this);
   struct kvm_kpit_state *pit_state = pit-pit_state;
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 3cece05..96b1070 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct kvm_io_device 
 *dev)
  }
  
  static int picdev_write(struct kvm_io_device *this,
 -  gpa_t addr, int len, const void *val)
 +  gpa_t addr, int len, const void *val,
 +  struct kvm_io_ext_data *ext_data)
  {
   struct kvm_pic *s = to_pic(this);
   unsigned char data = *(unsigned char *)val;
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 93cf9d0..f413e9c 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
 reg, u32 val)
  }
  
  static int apic_mmio_write(struct kvm_io_device *this,
 - gpa_t address, int len, const void *data)
 + gpa_t address, int len, const void *data,
 + struct kvm_io_ext_data *ext_data)
  {
   struct kvm_lapic *apic = to_lapic(this);
   unsigned int offset = address - apic-base_address;
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fa708c9..21b84e2 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void)
  }
  
  static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
 -const void *v)
 +const void *v, struct kvm_io_ext_data *ext_data)
  {
   if (vcpu-arch.apic 
 - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v))
 + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data))
   return 0;
  
 - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v);
 + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS,
 + addr, len, v, ext_data);
  }
  
  static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void 
 *v)
 @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned 
 long addr,
  struct kvm_vcpu *vcpu)
  {
   gpa_t gpa;
 + struct kvm_io_ext_data ext_data;
  
   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
  
 @@ -3825,7 +3827,7 @@ mmio:
   /*
* Is this MMIO handled locally?
*/
 - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
 + if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data))
   return X86EMUL_CONTINUE;
  
   vcpu-mmio_needed = 1;
 @@ -3940,6 +3942,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
  {
   /* TODO: String I/O for in kernel device */
   int r;
 + struct kvm_io_ext_data ext_data;
  
   if (vcpu-arch.pio.in)
   r = kvm_io_bus_read(vcpu-kvm, KVM_PIO_BUS, vcpu-arch.pio.port,
 @@ -3947,7 +3950,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
   else
   r = kvm_io_bus_write(vcpu-kvm, KVM_PIO_BUS,
vcpu-arch.pio.port, vcpu-arch.pio.size,
 -  pd);
 +  pd, ext_data);
   return r;
  }
  
 diff

Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel

2011-02-24 Thread Michael S. Tsirkin

On Thu, Feb 24, 2011 at 05:51:04PM +0800, Sheng Yang wrote:
 Then we can support mask bit operation of assigned devices now.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com

Doesn't look like all comments got addressed.
E.g. gpa_t entry_base is still there and in reality
you said it's a host virtual address so
should be void __user *;
And ENOTSYNC meaning 'MSIX' is pretty hacky.

 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/Makefile   |2 +-
  arch/x86/kvm/mmu.c  |2 +
  arch/x86/kvm/x86.c  |   40 -
  include/linux/kvm.h |   28 
  include/linux/kvm_host.h|   34 +
  virt/kvm/assigned-dev.c |   44 ++
  virt/kvm/kvm_main.c |   38 +-
  virt/kvm/msix_mmio.c|  296 
 +++
  virt/kvm/msix_mmio.h|   25 
  10 files changed, 497 insertions(+), 13 deletions(-)
  create mode 100644 virt/kvm/msix_mmio.c
  create mode 100644 virt/kvm/msix_mmio.h
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index aa75f21..4a390a4 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -635,6 +635,7 @@ enum emulation_result {
   EMULATE_DONE,   /* no further processing */
   EMULATE_DO_MMIO,  /* kvm_run filled with mmio request */
   EMULATE_FAIL, /* can't emulate this instruction */
 + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */
  };
  
  #define EMULTYPE_NO_DECODE   (1  0)
 diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
 index f15501f..3a0d851 100644
 --- a/arch/x86/kvm/Makefile
 +++ b/arch/x86/kvm/Makefile
 @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
  
  kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o 
 ioapic.o \
   coalesced_mmio.o irq_comm.o eventfd.o \
 - assigned-dev.o)
 + assigned-dev.o msix_mmio.o)
  kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, iommu.o)
  kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/, async_pf.o)
  
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9cafbb4..912dca4 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
 cr2, u32 error_code,
   case EMULATE_DO_MMIO:
   ++vcpu-stat.mmio_exits;
   /* fall through */
 + case EMULATE_USERSPACE_EXIT:
 + /* fall through */
   case EMULATE_FAIL:
   return 0;
   default:
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 21b84e2..87308eb 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   case KVM_CAP_X86_ROBUST_SINGLESTEP:
   case KVM_CAP_XSAVE:
   case KVM_CAP_ASYNC_PF:
 + case KVM_CAP_MSIX_MMIO:
   r = 1;
   break;
   case KVM_CAP_COALESCED_MMIO:
 @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned 
 long addr,
  {
   gpa_t gpa;
   struct kvm_io_ext_data ext_data;
 + int r;
  
   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
  
 @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned 
 long addr,
  
  mmio:
   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
 + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data);
   /*
* Is this MMIO handled locally?
*/
 - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data))
 + if (!r)
   return X86EMUL_CONTINUE;
  
 - vcpu-mmio_needed = 1;
 - vcpu-run-exit_reason = KVM_EXIT_MMIO;
 - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
 - vcpu-run-mmio.len = vcpu-mmio_size = bytes;
 - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
 - memcpy(vcpu-run-mmio.data, val, bytes);
 + if (r == -ENOTSYNC) {
 + vcpu-userspace_exit_needed = 1;
 + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE;
 + vcpu-run-msix_routing.dev_id =
 + ext_data.msix_routing.dev_id;
 + vcpu-run-msix_routing.type =
 + ext_data.msix_routing.type;
 + vcpu-run-msix_routing.entry_idx =
 + ext_data.msix_routing.entry_idx;
 + vcpu-run-msix_routing.flags =
 + ext_data.msix_routing.flags;
 + } else  {
 + vcpu-mmio_needed = 1;
 + vcpu-run-exit_reason = KVM_EXIT_MMIO;
 + vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
 + vcpu-run-mmio.len = vcpu-mmio_size = bytes;
 + vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
 + memcpy(vcpu-run-mmio.data, val, bytes);
 + }
  
   return X86EMUL_CONTINUE;
  }
 @@ -4469,6 +4485,8 @@ done:

Re: [PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler

2011-02-25 Thread Michael S. Tsirkin

On Fri, Feb 25, 2011 at 11:23:30AM +0800, Sheng Yang wrote:
 On Thursday 24 February 2011 18:22:19 Michael S. Tsirkin wrote:
  On Thu, Feb 24, 2011 at 05:51:03PM +0800, Sheng Yang wrote:
   Add a new parameter to IO writing handler, so that we can transfer
   information from IO handler to caller.
   
   Signed-off-by: Sheng Yang sh...@linux.intel.com
   ---
   
arch/x86/kvm/i8254.c  |6 --
arch/x86/kvm/i8259.c  |3 ++-
arch/x86/kvm/lapic.c  |3 ++-
arch/x86/kvm/x86.c|   13 -
include/linux/kvm_host.h  |   12 ++--
virt/kvm/coalesced_mmio.c |3 ++-
virt/kvm/eventfd.c|2 +-
virt/kvm/ioapic.c |2 +-
virt/kvm/iodev.h  |6 --
virt/kvm/kvm_main.c   |4 ++--
10 files changed, 36 insertions(+), 18 deletions(-)
   
   diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
   index efad723..bd8f0c5 100644
   --- a/arch/x86/kvm/i8254.c
   +++ b/arch/x86/kvm/i8254.c
   @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr)
   
}

static int pit_ioport_write(struct kvm_io_device *this,
   
   - gpa_t addr, int len, const void *data)
   + gpa_t addr, int len, const void *data,
   + struct kvm_io_ext_data *ext_data)
   
{

 struct kvm_pit *pit = dev_to_pit(this);
 struct kvm_kpit_state *pit_state = pit-pit_state;
   
   @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device
   *this,
   
}

static int speaker_ioport_write(struct kvm_io_device *this,
   
   - gpa_t addr, int len, const void *data)
   + gpa_t addr, int len, const void *data,
   + struct kvm_io_ext_data *ext_data)
   
{

 struct kvm_pit *pit = speaker_to_pit(this);
 struct kvm_kpit_state *pit_state = pit-pit_state;
   
   diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
   index 3cece05..96b1070 100644
   --- a/arch/x86/kvm/i8259.c
   +++ b/arch/x86/kvm/i8259.c
   @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct
   kvm_io_device *dev)
   
}

static int picdev_write(struct kvm_io_device *this,
   
   -  gpa_t addr, int len, const void *val)
   +  gpa_t addr, int len, const void *val,
   +  struct kvm_io_ext_data *ext_data)
   
{

 struct kvm_pic *s = to_pic(this);
 unsigned char data = *(unsigned char *)val;
   
   diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
   index 93cf9d0..f413e9c 100644
   --- a/arch/x86/kvm/lapic.c
   +++ b/arch/x86/kvm/lapic.c
   @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32
   reg, u32 val)
   
}

static int apic_mmio_write(struct kvm_io_device *this,
   
   - gpa_t address, int len, const void *data)
   + gpa_t address, int len, const void *data,
   + struct kvm_io_ext_data *ext_data)
   
{

 struct kvm_lapic *apic = to_lapic(this);
 unsigned int offset = address - apic-base_address;
   
   diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
   index fa708c9..21b84e2 100644
   --- a/arch/x86/kvm/x86.c
   +++ b/arch/x86/kvm/x86.c
   @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void)
   
}

static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
   
   -const void *v)
   +const void *v, struct kvm_io_ext_data *ext_data)
   
{

 if (vcpu-arch.apic 
   
   - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v))
   + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data))
   
 return 0;
   
   - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v);
   + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS,
   + addr, len, v, ext_data);
   
}

static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len,
void *v)
   
   @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned
   long addr,
   
struct kvm_vcpu *vcpu)

{

 gpa_t gpa;
   
   + struct kvm_io_ext_data ext_data;
   
 gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
   
   @@ -3825,7 +3827,7 @@ mmio:
 /*
 
  * Is this MMIO handled locally?
  */
   
   - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
   + if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data))
   
 return X86EMUL_CONTINUE;
 
 vcpu-mmio_needed = 1;
   
   @@ -3940,6 +3942,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void
   *pd)
   
{

 /* TODO: String I/O for in kernel device */
 int r;
   
   + struct kvm_io_ext_data ext_data;
   
 if (vcpu-arch.pio.in)
 
 r = kvm_io_bus_read(vcpu-kvm, KVM_PIO_BUS, vcpu

Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel

2011-02-25 Thread Michael S. Tsirkin

On Fri, Feb 25, 2011 at 02:28:02PM +0800, Sheng Yang wrote:
 On Thursday 24 February 2011 18:45:08 Michael S. Tsirkin wrote:
  On Thu, Feb 24, 2011 at 05:51:04PM +0800, Sheng Yang wrote:
   Then we can support mask bit operation of assigned devices now.
   
   Signed-off-by: Sheng Yang sh...@linux.intel.com
  
  Doesn't look like all comments got addressed.
  E.g. gpa_t entry_base is still there and in reality
  you said it's a host virtual address so
  should be void __user *;
 
 Would update it.
 
  And ENOTSYNC meaning 'MSIX' is pretty hacky.
 
 I'd like to discuss it later. We may need some work on all MMIO handling side 
 to 
 make it more straightforward. But I don't want to bundle it with this one... 

It's not PCI related so I'll defer to Avi/Marcelo on this.
Are you guys happy with the ENOTSYNC meaning 'MSIX'
and userspace_exit_needed hacks in this code?


  
   ---
   
arch/x86/include/asm/kvm_host.h |1 +
arch/x86/kvm/Makefile   |2 +-
arch/x86/kvm/mmu.c  |2 +
arch/x86/kvm/x86.c  |   40 -
include/linux/kvm.h |   28 
include/linux/kvm_host.h|   34 +
virt/kvm/assigned-dev.c |   44 ++
virt/kvm/kvm_main.c |   38 +-
virt/kvm/msix_mmio.c|  296
+++ virt/kvm/msix_mmio.h   
|   25 
10 files changed, 497 insertions(+), 13 deletions(-)
create mode 100644 virt/kvm/msix_mmio.c
create mode 100644 virt/kvm/msix_mmio.h
   
   diff --git a/arch/x86/include/asm/kvm_host.h
   b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -635,6 +635,7 @@ enum emulation_result {
   
 EMULATE_DONE,   /* no further processing */
 EMULATE_DO_MMIO,  /* kvm_run filled with mmio request */
 EMULATE_FAIL, /* can't emulate this instruction */
   
   + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */
   
};

#define EMULTYPE_NO_DECODE   (1  0)
   
   diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
   index f15501f..3a0d851 100644
   --- a/arch/x86/kvm/Makefile
   +++ b/arch/x86/kvm/Makefile
   @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
   
kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o 
   ioapic.o \

 coalesced_mmio.o irq_comm.o eventfd.o \
   
   - assigned-dev.o)
   + assigned-dev.o msix_mmio.o)
   
kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, iommu.o)
kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/,
async_pf.o)
   
   diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
   index 9cafbb4..912dca4 100644
   --- a/arch/x86/kvm/mmu.c
   +++ b/arch/x86/kvm/mmu.c
   @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t
   cr2, u32 error_code,
   
 case EMULATE_DO_MMIO:
 ++vcpu-stat.mmio_exits;
 /* fall through */
   
   + case EMULATE_USERSPACE_EXIT:
   + /* fall through */
   
 case EMULATE_FAIL:
 return 0;
 
 default:
   diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
   index 21b84e2..87308eb 100644
   --- a/arch/x86/kvm/x86.c
   +++ b/arch/x86/kvm/x86.c
   @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   
 case KVM_CAP_X86_ROBUST_SINGLESTEP:
 case KVM_CAP_XSAVE:
   
 case KVM_CAP_ASYNC_PF:
   + case KVM_CAP_MSIX_MMIO:
 r = 1;
 break;
 
 case KVM_CAP_COALESCED_MMIO:
   @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned
   long addr,
   
{

 gpa_t gpa;
 struct kvm_io_ext_data ext_data;
   
   + int r;
   
 gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
   
   @@ -3824,18 +3826,32 @@ static int
   emulator_write_emulated_onepage(unsigned long addr,
   
mmio:
 trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
   
   + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data);
   
 /*
 
  * Is this MMIO handled locally?
  */
   
   - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data))
   + if (!r)
   
 return X86EMUL_CONTINUE;
   
   - vcpu-mmio_needed = 1;
   - vcpu-run-exit_reason = KVM_EXIT_MMIO;
   - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
   - vcpu-run-mmio.len = vcpu-mmio_size = bytes;
   - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
   - memcpy(vcpu-run-mmio.data, val, bytes);
   + if (r == -ENOTSYNC) {
   + vcpu-userspace_exit_needed = 1;
   + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE;
   + vcpu-run-msix_routing.dev_id =
   + ext_data.msix_routing.dev_id;
   + vcpu-run-msix_routing.type =
   + ext_data.msix_routing.type;
   + vcpu-run

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-02-25 Thread Michael S. Tsirkin

On Fri, Feb 25, 2011 at 02:12:42PM +0800, Sheng Yang wrote:
 On Thursday 24 February 2011 18:17:34 Michael S. Tsirkin wrote:
  On Thu, Feb 24, 2011 at 05:44:20PM +0800, Sheng Yang wrote:
   On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote:
On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote:
 On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote:
  On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote:
   Then we can support mask bit operation of assigned devices now.
  
  Looks pretty good overall.  A few comments below.  It seems like we
  should be able to hook this into vfio with a small stub in kvm.  We
  just need to be able to communicate disabling and enabling of
  individual msix vectors.  For brute force, we could do this with a
  lot of eventfds, triggered by kvm and consumed by vfio, two per
  MSI-X vector.  Not sure if there's something smaller that could do
  it. Thanks,
 
 Alex, thanks for your comments. See my comments below:
  Alex
  
   Signed-off-by: Sheng Yang sh...@linux.intel.com
   ---
   
arch/x86/kvm/Makefile|2 +-
arch/x86/kvm/x86.c   |8 +-
include/linux/kvm.h  |   21 
include/linux/kvm_host.h |   25 
virt/kvm/assigned-dev.c  |   44 +++
virt/kvm/kvm_main.c  |   38 ++-
virt/kvm/msix_mmio.c |  286
++
virt/kvm/msix_mmio.h

|   25 

8 files changed, 442 insertions(+), 7 deletions(-)
create mode 100644 virt/kvm/msix_mmio.c
create mode 100644 virt/kvm/msix_mmio.h
   
   diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
   index f15501f..3a0d851 100644
   --- a/arch/x86/kvm/Makefile
   +++ b/arch/x86/kvm/Makefile
   @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
   
kvm-y+= $(addprefix ../../../virt/kvm/, 
   kvm_main.o 
 ioapic.o
   
   \
   
 coalesced_mmio.o irq_comm.o eventfd.o \
   
   - assigned-dev.o)
   + assigned-dev.o msix_mmio.o)
   
kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/,
iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF)  += $(addprefix
../../../virt/kvm/, async_pf.o)
   
   diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
   index fa708c9..89bf12c 100644
   --- a/arch/x86/kvm/x86.c
   +++ b/arch/x86/kvm/x86.c
   @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   
 case KVM_CAP_X86_ROBUST_SINGLESTEP:
 case KVM_CAP_XSAVE:
   
 case KVM_CAP_ASYNC_PF:
   + case KVM_CAP_MSIX_MMIO:
 r = 1;
 break;
 
 case KVM_CAP_COALESCED_MMIO:
   @@ -3807,6 +3808,7 @@ static int
   emulator_write_emulated_onepage(unsigned long addr,
   
struct kvm_vcpu *vcpu)

{

 gpa_t gpa;
   
   + int r;
   
 gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
   
   @@ -3822,14 +3824,16 @@ static int
   emulator_write_emulated_onepage(unsigned long addr,
   
mmio:
 trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
   
   + r = vcpu_mmio_write(vcpu, gpa, bytes, val);
   
 /*
 
  * Is this MMIO handled locally?
  */
   
   - if (!vcpu_mmio_write(vcpu, gpa, bytes, val))
   + if (!r)
   
 return X86EMUL_CONTINUE;
 
 vcpu-mmio_needed = 1;
   
   - vcpu-run-exit_reason = KVM_EXIT_MMIO;
   + vcpu-run-exit_reason = (r == -ENOTSYNC) ?
   + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO;

This use of -ENOTSYNC is IMO confusing.
How about we make vcpu_mmio_write return the positive exit reason?
Negative value will mean an error.
   
   In fact currently nagative value means something more need to be done,
   the same as MMIO exit.
  
  So it would be
  if (!r)
  return X86EMUL_CONTINUE;
  
  vcpu-run-exit_reason = r;
  
   Now I think we can keep it, or update them all later.
  
  The way to do this would be
  1. patch to return KVM_EXIT_MMIO on mmio
  2. your patch that returns KVM_EXIT_MSIX_ROUTING_UPDATE on top
 
 It's not that straightforward. In most condition, the reason 
 vcpu_mmio_write()  0 
 because KVM itself unable to complete the request. That's quite 
 straightforward. 
 But each handler in the chain can't decided it would be KVM_EXIT_MMIO, they 
 can 
 only know when all of them fail to handle the accessing.
 

Well, this just violates the standard API for return value.

The standard semantics are:
- negative - error
- positive - not an error

So you can just return

Re: Bug inkvm_set_irq

2011-02-27 Thread Michael S. Tsirkin

On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
 Hi,
 
 Each time i try tou use vhost_net, i'm facing a kernel bug.
 I do a modprobe vhost_net, and start guest whith vhost=on.
 
 Following is a trace with a kernel 2.6.37, but  i had the same
 problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).

2.6.36 had a theorectical race that could explain this,
but it should be ok in 2.6.37.

 
 The bug only occurs whith vhost_net charged, so i don't know if this
 is a bug in kvm module code or in the vhost_net code.

It could be a bug in eventfd which is the interface
used by both kvm and vhost_net.
Just for fun, you can try 3.6.38 - eventfd code has been changed
a lot in 2.6.38 and if it does not trigger there
it's a hint that irqfd is the reason.

 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243100] BUG: unable to handle kernel paging request at
 2458
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243250] IP: [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm]


Could you run markup_oops/ ksymoops on this please?
As far as I can see kvm_set_irq can only get a wrong
kvm pointer. Unless there's some general memory corruption,
I'd guess 

You can also try comparing the irqfd-kvm pointer in
kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
virt/kvm/eventfd.c.

 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243556] Oops:  [#1] SMP
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243692] last sysfs file:
 /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243820] Modules linked in: vhost_net macvtap macvlan tun
 powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
 cpufreq_ondemand fre
 q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
 ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
 xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
 nf_conntrack_ftp nf_connt
 rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
 dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
 nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
 snd_page_alloc tpm_tis tpm ps
 mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
 serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
 xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
 sd_mod enclosu
 re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
 scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RIP: 0010:[a041aa8a]  [a041aa8a]
 kvm_set_irq+0x2a/0x130 [kvm]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RSP: 0018:88045fc89d30  EFLAGS: 00010246
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RAX:  RBX: 001a RCX:
 0001
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RDX:  RSI:  RDI:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RBP:  R08: 0001 R09:
 880856a91e48
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] R10:  R11:  R12:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] R13: 0001 R14:  R15:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] FS:  7f617986c710() GS:88007f80()
 knlGS:
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] CS:  0010 DS:  ES:  CR0: 8005003b
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] CR2: 2458 CR3: 00045d197000 CR4:
 06f0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] DR0:  DR1:  DR2:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] DR3:  DR6: 0ff0 DR7:
 0400
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] Process kworker/0:1 (pid: 10, threadinfo
 88045fc88000, task 88085fc53c30)
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123]  88045fc89fd8 000119c0 88045fc88010
 88085fc53ee8
 Feb 23 13:56:19

Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net

2011-02-27 Thread Michael S. Tsirkin

On Mon, Feb 28, 2011 at 12:04:27PM +0530, Krishna Kumar wrote:
 This patch series is a continuation of an earlier one that
 implemented guest MQ TX functionality.  This new patchset
 implements both RX and TX MQ.  Qemu changes are not being
 included at this time solely to aid in easier review.
 Compatibility testing with old/new combinations of qemu/guest
 and vhost was done without any issues.
 
 Some early TCP/UDP test results are at the bottom of this
 post, I plan to submit more test results in the coming days.
 
 Please review and provide feedback on what can improve.
 
 Thanks!
 
 Signed-off-by: Krishna Kumar krkum...@in.ibm.com


To help testing, could you post the qemu changes separately please?

 ---
 
 
 Test configuration:
   Host:  8 Intel Xeon, 8 GB memory
   Guest: 4 cpus, 2 GB memory
 
 Each test case runs for 60 secs, results below are average over
 two runs.  Bandwidth numbers are in gbps.  I have used default
 netperf, and no testing/system tuning other than taskset each
 vhost to 0xf (cpus 0-3).  Comparison is testing original kernel
 vs new kernel with #txqs=8 (# refers to number of netperf
 sessions).
 
 ___
 TCP: Guest - Local Host (TCP_STREAM)
 #BW1BW2 (%) SD1SD2 (%)   RSD1RSD2 (%)
 ___
 17190   7170 (-.2)  0  0 (0)  3   4 (33.3)
 28774   11235 (28.0)3  3 (0)  16  14 (-12.5)
 49753   15195 (55.7)17 21 (23.5)  65  59 (-9.2)
 810224  18265 (78.6)71 115 (61.9) 251 240 (-4.3)
 16   10749  18123 (68.6)277456 (64.6) 985 925 (-6.0)
 32   11133  17351 (55.8)1132   1947 (71.9)39353831 (-2.6)
 64   11223  17115 (52.4)4682   7836 (67.3)15949   15373 (-3.6)
 128  11269  16986 (50.7)19783  31505 (59.2)   66799   61759 (-7.5)
 ___
 Summary:  BW: 37.6  SD: 61.2  RSD: -6.5
 
 
 ___
  TCP: Local Host - Guest (TCP_MAERTS)
 #BW1BW2 (%)SD1SD2 (%)RSD1RSD2 (%)
 ___
 111490  10870 (-5.3)   0  0 (0)  2   2 (0)
 210612  10554 (-.5)2  3 (50.0)   12  12 (0)
 410047  14320 (42.5)   13 16 (23.0)  53  53 (0)
 89273   15182 (63.7)   56 84 (50.0)  228 233 (2.1)
 16   9291   15853 (70.6)   235390 (65.9) 934 965 (3.3)
 32   9382   15741 (67.7)   9691823 (88.1)38684037 (4.3)
 64   9270   14585 (57.3)   3966   8836 (122.7)   15415   17818 (15.5)
 128  8997   14678 (63.1)   17024  36649 (115.2)  64933   72677 (11.9)
 ___
 SUM:  BW: 24.8  SD: 114.6  RSD: 12.1
 
 __
 UDP: Local Host - Guest (UDP_STREAM)
 #  BW1  BW2 (%)SD1SD2 (%)
 __
 1  1723616585 (-3.7)1  1 (0)
 2  1679522693 (35.1)5  6 (20.0)
 4  1339021502 (60.5)37 36 (-2.7)
 8  1326124361 (83.7)163175 (7.3)
 16 1277223796 (86.3)692826 (19.3)
 32 1283223880 (86.0)2812   2871 (2.0)
 64 1277924293 (90.1)11299  11237 (-.5)
 1281300624857 (91.1)44778  43884 (-1.9)
 __
 Summary:  BW: 37.1  SD: -1.2
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net

2011-02-28 Thread Michael S. Tsirkin

On Mon, Feb 28, 2011 at 12:04:37PM +0530, Krishna Kumar wrote:
 Implement mq virtio-net driver. 
 
 Though struct virtio_net_config changes, it works with the old
 qemu since the last element is not accessed unless qemu sets
 VIRTIO_NET_F_NUMTXQS.  Patch also adds a macro for the maximum
 number of TX vq's (VIRTIO_MAX_TXQS) that the user can specify.
 
 Signed-off-by: Krishna Kumar krkum...@in.ibm.com

Overall looks good.
The numtxqs meaning the number of rx queues needs some cleanup.
init/cleanup routines need more symmetry.
Error handling on setup also seems slightly buggy or at least asymmetrical.
Finally, this will use up a large number of MSI vectors,
while TX interrupts mostly stay unused.

Some comments below.

 ---
  drivers/net/virtio_net.c   |  543 ---
  include/linux/virtio_net.h |6 
  2 files changed, 386 insertions(+), 163 deletions(-)
 
 diff -ruNp org/include/linux/virtio_net.h new/include/linux/virtio_net.h
 --- org/include/linux/virtio_net.h2010-10-11 10:20:22.0 +0530
 +++ new/include/linux/virtio_net.h2011-02-25 16:24:15.0 +0530
 @@ -7,6 +7,9 @@
  #include linux/virtio_config.h
  #include linux/if_ether.h
  
 +/* Maximum number of individual RX/TX queues supported */
 +#define VIRTIO_MAX_TXQS  16
 +

This also does not seem to belong in the header.

  /* The feature bitmap for virtio net */
  #define VIRTIO_NET_F_CSUM0   /* Host handles pkts w/ partial csum */
  #define VIRTIO_NET_F_GUEST_CSUM  1   /* Guest handles pkts w/ 
 partial csum */
 @@ -26,6 +29,7 @@
  #define VIRTIO_NET_F_CTRL_RX 18  /* Control channel RX mode support */
  #define VIRTIO_NET_F_CTRL_VLAN   19  /* Control channel VLAN 
 filtering */
  #define VIRTIO_NET_F_CTRL_RX_EXTRA 20/* Extra RX mode control 
 support */
 +#define VIRTIO_NET_F_NUMTXQS 21  /* Device supports multiple TX queue */

VIRTIO_NET_F_MULTIQUEUE ?

  
  #define VIRTIO_NET_S_LINK_UP 1   /* Link is up */
  
 @@ -34,6 +38,8 @@ struct virtio_net_config {
   __u8 mac[6];
   /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
   __u16 status;
 + /* number of RX/TX queues */
 + __u16 numtxqs;


The interface here is a bit ugly:
- this is really both # of tx and rx queues but called numtxqs
- there's a hardcoded max value
- 0 is assumed to be same as 1
- assumptions above are undocumented.

One way to address this could be num_queue_pairs, and something like
/* The actual number of TX and RX queues is num_queue_pairs + 1 each. */
__u16 num_queue_pairs;
(and tweak code to match).

Alternatively, have separate registers for the number of tx and rx queues.

  } __attribute__((packed));
  
  /* This is the first element of the scatter-gather list.  If you don't
 diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
 --- org/drivers/net/virtio_net.c  2011-02-21 17:55:42.0 +0530
 +++ new/drivers/net/virtio_net.c  2011-02-25 16:23:41.0 +0530
 @@ -40,31 +40,53 @@ module_param(gso, bool, 0444);
  
  #define VIRTNET_SEND_COMMAND_SG_MAX2
  
 -struct virtnet_info {
 - struct virtio_device *vdev;
 - struct virtqueue *rvq, *svq, *cvq;
 - struct net_device *dev;
 +/* Internal representation of a send virtqueue */
 +struct send_queue {
 + struct virtqueue *svq;
 +
 + /* TX: fragments + linear part + virtio header */
 + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 +};
 +
 +/* Internal representation of a receive virtqueue */
 +struct receive_queue {
 + /* Virtqueue associated with this receive_queue */
 + struct virtqueue *rvq;
 +
 + /* Back pointer to the virtnet_info */
 + struct virtnet_info *vi;
 +
   struct napi_struct napi;
 - unsigned int status;
  
   /* Number of input buffers, and max we've ever had. */
   unsigned int num, max;
  
 - /* I like... big packets and I cannot lie! */
 - bool big_packets;
 -
 - /* Host will merge rx buffers for big packets (shake it! shake it!) */
 - bool mergeable_rx_bufs;
 -
   /* Work struct for refilling if we run low on memory. */
   struct delayed_work refill;
  
   /* Chain pages by the private ptr. */
   struct page *pages;
  
 - /* fragments + linear part + virtio header */
 + /* RX: fragments + linear part + virtio header */
   struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
 - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 +};
 +
 +struct virtnet_info {
 + struct send_queue **sq;
 + struct receive_queue **rq;
 +
 + /* read-mostly variables */
 + int numtxqs cacheline_aligned_in_smp;   /* # of rxqs/txqs */

Why do you think this alignment is a win?

 + struct virtio_device *vdev;
 + struct virtqueue *cvq;
 + struct net_device *dev;
 + unsigned int status;
 +
 + /* I like... big packets and I cannot lie! */
 + bool big_packets;
 +
 + /* Host will merge rx buffers for big packets (shake

Re: Bug inkvm_set_irq

2011-02-28 Thread Michael S. Tsirkin

On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
 Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
 On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
 Hi,
 
 Each time i try tou use vhost_net, i'm facing a kernel bug.
 I do a modprobe vhost_net, and start guest whith vhost=on.
 
 Following is a trace with a kernel 2.6.37, but  i had the same
 problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
 2.6.36 had a theorectical race that could explain this,
 but it should be ok in 2.6.37.
 
 The bug only occurs whith vhost_net charged, so i don't know if this
 is a bug in kvm module code or in the vhost_net code.
 It could be a bug in eventfd which is the interface
 used by both kvm and vhost_net.
 Just for fun, you can try 3.6.38 - eventfd code has been changed
 a lot in 2.6.38 and if it does not trigger there
 it's a hint that irqfd is the reason.
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243100] BUG: unable to handle kernel paging request at
 2458
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243250] IP: [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm]
 
 Could you run markup_oops/ ksymoops on this please?
 As far as I can see kvm_set_irq can only get a wrong
 kvm pointer. Unless there's some general memory corruption,
 I'd guess
 
 You can also try comparing the irqfd-kvm pointer in
 kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
 virt/kvm/eventfd.c.
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243556] Oops:  [#1] SMP
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243692] last sysfs file:
 /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.243820] Modules linked in: vhost_net macvtap macvlan tun
 powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
 cpufreq_ondemand fre
 q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
 ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
 xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
 nf_conntrack_ftp nf_connt
 rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
 dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
 nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
 snd_page_alloc tpm_tis tpm ps
 mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
 serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
 xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
 sd_mod enclosu
 re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
 scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RIP: 0010:[a041aa8a]  [a041aa8a]
 kvm_set_irq+0x2a/0x130 [kvm]
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RSP: 0018:88045fc89d30  EFLAGS: 00010246
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RAX:  RBX: 001a RCX:
 0001
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RDX:  RSI:  RDI:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] RBP:  R08: 0001 R09:
 880856a91e48
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] R10:  R11:  R12:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] R13: 0001 R14:  R15:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] FS:  7f617986c710() GS:88007f80()
 knlGS:
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] CS:  0010 DS:  ES:  CR0: 8005003b
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] CR2: 2458 CR3: 00045d197000 CR4:
 06f0
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] DR0:  DR1:  DR2:
 
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] DR3:  DR6: 0ff0 DR7:
 0400
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
 685.246123] Process kworker/0:1 (pid: 10, threadinfo
 88045fc88000, task 88085fc53c30)
 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
 Feb 23 13:56:19 ayrshire.u06.univ

Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel

2011-02-28 Thread Michael S. Tsirkin

On Mon, Feb 28, 2011 at 03:20:04PM +0800, Sheng Yang wrote:
 Then we can support mask bit operation of assigned devices now.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com

A general question: we implement mmio read and write
operation here, but seem to do nothing
about ordering. In particular, pci mmio reads are normally
assumed to flush all writes out to the device
and all device writes in to the CPU.

How exactly does it work here?

 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/Makefile   |2 +-
  arch/x86/kvm/mmu.c  |2 +
  arch/x86/kvm/x86.c  |   40 -
  include/linux/kvm.h |   28 
  include/linux/kvm_host.h|   36 +
  virt/kvm/assigned-dev.c |   44 ++
  virt/kvm/kvm_main.c |   38 +-
  virt/kvm/msix_mmio.c|  301 
 +++
  virt/kvm/msix_mmio.h|   25 
  10 files changed, 504 insertions(+), 13 deletions(-)
  create mode 100644 virt/kvm/msix_mmio.c
  create mode 100644 virt/kvm/msix_mmio.h
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index aa75f21..4a390a4 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -635,6 +635,7 @@ enum emulation_result {
   EMULATE_DONE,   /* no further processing */
   EMULATE_DO_MMIO,  /* kvm_run filled with mmio request */
   EMULATE_FAIL, /* can't emulate this instruction */
 + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */
  };
  
  #define EMULTYPE_NO_DECODE   (1  0)
 diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
 index f15501f..3a0d851 100644
 --- a/arch/x86/kvm/Makefile
 +++ b/arch/x86/kvm/Makefile
 @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
  
  kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o 
 ioapic.o \
   coalesced_mmio.o irq_comm.o eventfd.o \
 - assigned-dev.o)
 + assigned-dev.o msix_mmio.o)
  kvm-$(CONFIG_IOMMU_API)  += $(addprefix ../../../virt/kvm/, iommu.o)
  kvm-$(CONFIG_KVM_ASYNC_PF)   += $(addprefix ../../../virt/kvm/, async_pf.o)
  
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9cafbb4..912dca4 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
 cr2, u32 error_code,
   case EMULATE_DO_MMIO:
   ++vcpu-stat.mmio_exits;
   /* fall through */
 + case EMULATE_USERSPACE_EXIT:
 + /* fall through */
   case EMULATE_FAIL:
   return 0;
   default:
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 21b84e2..87308eb 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   case KVM_CAP_X86_ROBUST_SINGLESTEP:
   case KVM_CAP_XSAVE:
   case KVM_CAP_ASYNC_PF:
 + case KVM_CAP_MSIX_MMIO:
   r = 1;
   break;
   case KVM_CAP_COALESCED_MMIO:
 @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned 
 long addr,
  {
   gpa_t gpa;
   struct kvm_io_ext_data ext_data;
 + int r;
  
   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
  
 @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned 
 long addr,
  
  mmio:
   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
 + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data);
   /*
* Is this MMIO handled locally?
*/
 - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data))
 + if (!r)
   return X86EMUL_CONTINUE;
  
 - vcpu-mmio_needed = 1;
 - vcpu-run-exit_reason = KVM_EXIT_MMIO;
 - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
 - vcpu-run-mmio.len = vcpu-mmio_size = bytes;
 - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
 - memcpy(vcpu-run-mmio.data, val, bytes);
 + if (r == -ENOTSYNC) {

Can you replace -ENOTSYNC with KVM_EXIT_MSIX_ROUTING_UPDATE all over please?

 + vcpu-userspace_exit_needed = 1;
 + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE;
 + vcpu-run-msix_routing.dev_id =
 + ext_data.msix_routing.dev_id;
 + vcpu-run-msix_routing.type =
 + ext_data.msix_routing.type;
 + vcpu-run-msix_routing.entry_idx =
 + ext_data.msix_routing.entry_idx;
 + vcpu-run-msix_routing.flags =
 + ext_data.msix_routing.flags;
 + } else  {
 + vcpu-mmio_needed = 1;
 + vcpu-run-exit_reason = KVM_EXIT_MMIO;
 + vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
 + vcpu-run-mmio.len = vcpu-mmio_size = bytes;
 + vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1;
 +

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 4680 matches

Mail list logo