Re: [PATCH] slow_map: minor improvements to ROM BAR handling
On Tue, Dec 22, 2009 at 02:34:42PM +0100, Alexander Graf wrote: Michael S. Tsirkin wrote: On Tue, Dec 22, 2009 at 01:05:23PM +0100, Alexander Graf wrote: Michael S. Tsirkin wrote: ROM BAR can be handled same as regular BAR: load_option_roms utility will take care of copying it to RAM as appropriate. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This patch applies on top of agraf's one, it takes care of non-page aligned ROM BARs as well: they mostly are taken care of, we just do not need to warn user about them. hw/device-assignment.c | 20 +--- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 000fa61..066fdb6 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -486,25 +486,23 @@ static int assigned_dev_register_regions(PCIRegion *io_regions, : PCI_BASE_ADDRESS_SPACE_MEMORY; if (cur_region-size 0xFFF) { -fprintf(stderr, PCI region %d at address 0x%llx -has size 0x%x, which is not a multiple of 4K. -You might experience some performance hit due to that.\n, -i, (unsigned long long)cur_region-base_addr, -cur_region-size); +if (i != PCI_ROM_SLOT) { +fprintf(stderr, PCI region %d at address 0x%llx +has size 0x%x, which is not a multiple of 4K. +You might experience some performance hit +due to that.\n, +i, (unsigned long long)cur_region-base_addr, +cur_region-size); +} slow_map = 1; This is wrong. You're setting slow_map = 1 on code that is very likely to be executed inside the guest. That doesn't work. It is? Can you really run code directly from a PCI card? I looked at BIOS boot specification and it always talks about shadowing PCI ROMs. I'm not sure the BIOS is the only one executing ROMs. If it is, then I'm good with the change. Maybe it'd make sense to also add a read only flag so we don't accidently try to write to the ROM region with slow_map. Alex Correct: I think it's made readonly down the road with mprotect, so attempt to do so will crash qemu :) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] slow_map: minor improvements to ROM BAR handling
On Tue, Dec 22, 2009 at 05:00:52PM +0100, Alexander Graf wrote: Avi Kivity wrote: On 12/22/2009 05:41 PM, Alexander Graf wrote: We could certainly extend emulate.c to fetch instruction bytes from userspace. It uses -read_std() now, so we'd need to switch to -read_emulated() and add appropriate buffering. I thought the policy on emulate.c was to not have a full instruction emulator but only emulate instructions that do PT modifications or MMIO access? It's not a policy, just laziness. With emulate_invalid_guest_state=1 we need many more instructions. Of course I don't want to add instructions just for the sake of it, since they will be untested. I'd much prefer not to run from mmio if possible - just pointing out it's doable. Right... emulator is _really_ small. It only does a few MMU specific instructions, a couple of privileged ones and MMIO accessing ones. Btw, we're in the same situation with PowerPC here. The instruction Plus, you have a fixed length instruction length, likely more regular too. I imagine powerpc is load/store, so you don't have to emulate a zillion ALU instructions? Well, it's certainly doable (and easier than on x86). But I'm on the same position as you on the x86 side. Why increase the emulator size at least 10 times if we don't have to? Either way, people will report bugs when / if they actually start executing code off MMIO. So let's not care too much about it for now. Just make sure the read-only check is in. Alex So I think all we need is this on top? diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 066fdb6..0c3c8f4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -233,7 +233,8 @@ static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, int region_num, int m; DEBUG(slow map\n); -m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region); +m = cpu_register_io_memory(slow_bar_read, region_num == PCI_ROM_SLOT ? + NULL : slow_bar_write, region); cpu_register_physical_memory(e_phys, e_size, m); /* MSI-X MMIO page */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] slow_map: minor improvements to ROM BAR handling
ROM BAR can be handled same as regular BAR: load_option_roms utility will take care of copying it to RAM as appropriate. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Changes from v1: made ROM BAR read-only. hw/device-assignment.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 000fa61..0c3c8f4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -233,7 +233,8 @@ static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, int region_num, int m; DEBUG(slow map\n); -m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region); +m = cpu_register_io_memory(slow_bar_read, region_num == PCI_ROM_SLOT ? + NULL : slow_bar_write, region); cpu_register_physical_memory(e_phys, e_size, m); /* MSI-X MMIO page */ @@ -486,25 +487,23 @@ static int assigned_dev_register_regions(PCIRegion *io_regions, : PCI_BASE_ADDRESS_SPACE_MEMORY; if (cur_region-size 0xFFF) { -fprintf(stderr, PCI region %d at address 0x%llx -has size 0x%x, which is not a multiple of 4K. -You might experience some performance hit due to that.\n, -i, (unsigned long long)cur_region-base_addr, -cur_region-size); +if (i != PCI_ROM_SLOT) { +fprintf(stderr, PCI region %d at address 0x%llx +has size 0x%x, which is not a multiple of 4K. +You might experience some performance hit +due to that.\n, +i, (unsigned long long)cur_region-base_addr, +cur_region-size); +} slow_map = 1; } -if (slow_map (i == PCI_ROM_SLOT)) { -fprintf(stderr, ROM not aligned - can't continue\n); -return -1; -} - /* map physical memory */ pci_dev-v_addrs[i].e_physbase = cur_region-base_addr; if (i == PCI_ROM_SLOT) { pci_dev-v_addrs[i].u.r_virtbase = mmap(NULL, - (cur_region-size + 0xFFF) 0xF000, + cur_region-size, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, (off_t) 0); -- 1.6.6.rc1.43.gf55cc -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] slow_map: minor improvements to ROM BAR handling
On Tue, Dec 22, 2009 at 05:39:22PM +0200, Avi Kivity wrote: On 12/22/2009 05:36 PM, Alexander Graf wrote: Is there a way to trap this and fprintf something? I don't think so. KVM will just trap on execution outside of RAM and either fail badly or throw something bad into the guest. MMIO access works by analyzing the instruction that accesses the MMIO address. That just doesn't work when we don't have an instruction to analyze. We could certainly extend emulate.c to fetch instruction bytes from userspace. It uses -read_std() now, so we'd need to switch to -read_emulated() and add appropriate buffering. You mean run with KVM, and TCG will kick in when there's an instruction we can't support natively? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3] slow_map: minor improvements to ROM BAR handling
ROM BAR can be handled same as regular BAR: load_option_roms utility will take care of copying it to RAM as appropriate. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Changes from v2: replace ?: with an if statement changes from v1: make ROM bar read only hw/device-assignment.c | 25 + 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 000fa61..9197dca 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -233,7 +233,10 @@ static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, int region_num, int m; DEBUG(slow map\n); -m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region); +if (region_num == PCI_ROM_SLOT) +m = cpu_register_io_memory(slow_bar_read, NULL, region); +else +m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region); cpu_register_physical_memory(e_phys, e_size, m); /* MSI-X MMIO page */ @@ -486,25 +489,23 @@ static int assigned_dev_register_regions(PCIRegion *io_regions, : PCI_BASE_ADDRESS_SPACE_MEMORY; if (cur_region-size 0xFFF) { -fprintf(stderr, PCI region %d at address 0x%llx -has size 0x%x, which is not a multiple of 4K. -You might experience some performance hit due to that.\n, -i, (unsigned long long)cur_region-base_addr, -cur_region-size); +if (i != PCI_ROM_SLOT) { +fprintf(stderr, PCI region %d at address 0x%llx +has size 0x%x, which is not a multiple of 4K. +You might experience some performance hit +due to that.\n, +i, (unsigned long long)cur_region-base_addr, +cur_region-size); +} slow_map = 1; } -if (slow_map (i == PCI_ROM_SLOT)) { -fprintf(stderr, ROM not aligned - can't continue\n); -return -1; -} - /* map physical memory */ pci_dev-v_addrs[i].e_physbase = cur_region-base_addr; if (i == PCI_ROM_SLOT) { pci_dev-v_addrs[i].u.r_virtbase = mmap(NULL, - (cur_region-size + 0xFFF) 0xF000, + cur_region-size, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, (off_t) 0); -- 1.6.6.rc1.43.gf55cc -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: access check thinko fixes
This fixes two issues with recent access checking patch: 1. if (d-vqs[i].private_data) - if (d-vqs[i].private_data) 2. we can't forbid log base changes while ring is running, because host needs to resize log in rare cases (e.g. when memory is added with a baloon) and in that case it needs to increase log size with realloc, which might move the log address. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Rusty, this needs to be applied on top of the access_ok patch. If you want me to rool it in with that one and esend, let me know please. Thanks! drivers/vhost/vhost.c | 39 +++ 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2b65d9b..c8c25db 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -230,7 +230,7 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz) } /* Caller should have vq mutex and device mutex. */ -static int vq_memory_access_ok(struct vhost_virtqueue *vq, struct vhost_memory *mem, +static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem, int log_all) { int i; @@ -242,7 +242,7 @@ static int vq_memory_access_ok(struct vhost_virtqueue *vq, struct vhost_memory * else if (!access_ok(VERIFY_WRITE, (void __user *)a, m-memory_size)) return 0; - else if (log_all !log_access_ok(vq-log_base, + else if (log_all !log_access_ok(log_base, m-guest_phys_addr, m-memory_size)) return 0; @@ -259,9 +259,10 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem, for (i = 0; i d-nvqs; ++i) { int ok; mutex_lock(d-vqs[i].mutex); - /* If ring is not running, will check when it's enabled. */ - if (d-vqs[i].private_data) - ok = vq_memory_access_ok(d-vqs[i], mem, log_all); + /* If ring is inactive, will check when it's enabled. */ + if (d-vqs[i].private_data) + ok = vq_memory_access_ok(d-vqs[i].log_base, mem, +log_all); else ok = 1; mutex_unlock(d-vqs[i].mutex); @@ -290,18 +291,25 @@ int vhost_log_access_ok(struct vhost_dev *dev) return memory_access_ok(dev, dev-memory, 1); } -/* Can we start vq? */ +/* Verify access for write logging. */ /* Caller should have vq mutex and device mutex */ -int vhost_vq_access_ok(struct vhost_virtqueue *vq) +static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base) { - return vq_access_ok(vq-num, vq-desc, vq-avail, vq-used) - vq_memory_access_ok(vq, vq-dev-memory, + return vq_memory_access_ok(log_base, vq-dev-memory, vhost_has_feature(vq-dev, VHOST_F_LOG_ALL)) - (!vq-log_used || log_access_ok(vq-log_base, vq-log_addr, + (!vq-log_used || log_access_ok(log_base, vq-log_addr, sizeof *vq-used + vq-num * sizeof *vq-used-ring)); } +/* Can we start vq? */ +/* Caller should have vq mutex and device mutex */ +int vhost_vq_access_ok(struct vhost_virtqueue *vq) +{ + return vq_access_ok(vq-num, vq-desc, vq-avail, vq-used) + vq_log_access_ok(vq, vq-log_base); +} + static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) { struct vhost_memory mem, *newmem, *oldmem; @@ -564,15 +572,14 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg) } for (i = 0; i d-nvqs; ++i) { struct vhost_virtqueue *vq; + void __user *base = (void __user *)(unsigned long)p; vq = d-vqs + i; mutex_lock(vq-mutex); - /* Moving log base with an active backend? -* You don't want to do that. */ - if (vq-private_data (vq-log_used || -vhost_has_feature(d, VHOST_F_LOG_ALL))) - r = -EBUSY; + /* If ring is inactive, will check when it's enabled. */ + if (vq-private_data !vq_log_access_ok(vq, base)) + r = -EFAULT; else - vq-log_base = (void __user *)(unsigned long)p; + vq-log_base = base; mutex_unlock(vq-mutex); } break; -- 1.6.6.rc1.43.gf55cc -- To unsubscribe from
Re: [PATCH] vhost-net: defer f-private_data until setup succeeds
On Tue, Dec 22, 2009 at 11:02:28AM -0800, Chris Wright wrote: Trivial change, just for readability. The filp is not installed on failure, so the current code is not incorrect (also vhost_dev_init currently has no failure case). This just treats setting f-private_data as something with global scope (sure, true only after fd_install). Signed-off-by: Chris Wright chr...@redhat.com Acked-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/net.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..0697ab2 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -326,7 +326,6 @@ static int vhost_net_open(struct inode *inode, struct file *f) int r; if (!n) return -ENOMEM; - f-private_data = n; n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX); @@ -338,6 +337,9 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT); vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN); n-tx_poll_state = VHOST_NET_POLL_DISABLED; + + f-private_data = n; + return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On Wed, Dec 23, 2009 at 11:28:08AM -0800, Ira W. Snyder wrote: On Wed, Dec 23, 2009 at 12:34:44PM -0500, Gregory Haskins wrote: On 12/23/09 1:15 AM, Kyle Moffett wrote: On Tue, Dec 22, 2009 at 12:36, Gregory Haskins gregory.hask...@gmail.com wrote: On 12/22/09 2:57 AM, Ingo Molnar wrote: * Gregory Haskins gregory.hask...@gmail.com wrote: Actually, these patches have nothing to do with the KVM folks. [...] That claim is curious to me - the AlacrityVM host It's quite simple, really. These drivers support accessing vbus, and vbus is hypervisor agnostic. In fact, vbus isn't necessarily even hypervisor related. It may be used anywhere where a Linux kernel is the io backend, which includes hypervisors like AlacrityVM, but also userspace apps, and interconnected physical systems as well. So focus on interconnecting physical systems I think would be one way for vbus to stop conflicting with KVM. If drivers for such systems appear I expect that relevant (hypervisor-agnostic) vbus bits would be very uncontroversial. This would not be the first technology to make the jump from attempting to be a PCI replacement to being an interconnect btw, I think infiniband did this as well. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] vhost: rcu annotation fixup
When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes, and private pointer is read with our kind of rcu where work serves as a read side critical section. Fixing it properly is not trivial. Disable the warnings by stubbing out the checks for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Changes from v1: add TODO, fix more warnings. drivers/vhost/net.c |9 + drivers/vhost/vhost.h |6 +++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..f616cef 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -128,8 +128,7 @@ static void handle_tx(struct vhost_net *net) size_t hdr_size; struct socket *sock; - /* TODO: check that we are running from vhost_worker? -* Not sure it's worth it, it's straight-forward enough. */ + /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; @@ -306,7 +305,8 @@ static void handle_rx_big(struct vhost_net *net) size_t len, total_len = 0; int err; size_t hdr_size; - struct socket *sock = rcu_dereference(vq-private_data); + /* TODO: check that we are running from vhost_worker? */ + struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) return; @@ -415,7 +415,8 @@ static void handle_rx_mergeable(struct vhost_net *net) int err, headcount; size_t vhost_hlen, sock_hlen; size_t vhost_len, sock_len; - struct socket *sock = rcu_dereference(vq-private_data); + /* TODO: check that we are running from vhost_worker? */ + struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) return; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..b3363ae 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,9 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + /* TODO: check that we are running from vhost_worker or dev mutex is +* held? */ + acked_features = rcu_dereference_index_check(dev-acked_features, 1); return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Tue, Jan 18, 2011 at 11:02:33AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote: On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2af44b7..2d03a31 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit) { unsigned acked_features; - acked_features = - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); + acked_features = rcu_dereference_index_check(dev-acked_features, 1); Ouch!!! Could you please at least add a comment? Yes, OK. Alternatively, pass in the lock that is held and check for that? Given that this is a static inline, the compiler should be able to optimize the argument away when !PROVE_RCU, correct? Thanx, Paul Hopefully, yes. We don't always have a lock: the idea was to create a lockdep for these cases. But we can't pass the pointer to that ... I suppose you could pass a pointer to the lockdep map structure. Not sure if this makes sense, but it would handle the situation. Will it compile with lockdep disabled too? What will the pointer be? Alternatively, create a helper function that checks the possibilities and screams if none of them are in effect. Thanx, Paul The problem here is the callee needs to know about all callers. return acked_features (1 bit); } -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: http://www.netperf.org/svn/netperf2/trunk to get the recently added support for accurate (netperf level) RTT measuremnts on burst-mode request/response tests. happy benchmarking, rick jones PS - the enhanced latency statistics from -j are only available in the omni version of the TCP_RR test. To get that add a --enable-omni to the ./configure - and in this case both netperf and netserver have to be recompiled. Is this TCP only? I would love to get latency data from UDP as well. For very basic output one can peruse the output of: src/netperf -t omni -- -O /? and then pick those outputs of interest and put them into an output selection file which one then passes to either (test-specific) -o, -O or -k to get CVS, Human or keyval output respectively. E.G. raj@tardy:~/netperf2_trunk$ cat foo THROUGHPUT,THROUGHPUT_UNITS RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY when foo is passed to -o one will get those all on one line of CSV. To -O one gets three lines of more netperf-classic-like human readable output, and when one passes that to -k one gets a string of keyval output a la: raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram THROUGHPUT=29454.12 THROUGHPUT_UNITS=Trans/s RT_LATENCY=33.951 MIN_LATENCY=19 MEAN_LATENCY=32.00 MAX_LATENCY=126 P50_LATENCY=32 P90_LATENCY=38 P99_LATENCY=41 STDDEV_LATENCY=5.46 Histogram of request/response times UNIT_USEC :0:0:0:0:0:0:0:0:0:0 TEN_USEC :0: 3553: 45244: 237790: 7859: 86: 10:3:0:0 HUNDRED_USEC :0:2:0:0:0:0:0:0:0:0 UNIT_MSEC :0:0:0:0:0:0:0:0:0:0 TEN_MSEC :0:0:0:0:0:0:0:0:0:0 HUNDRED_MSEC :0:0:0:0:0:0:0:0:0:0 UNIT_SEC :0:0:0:0:0:0:0:0:0:0 TEN_SEC :0:0:0:0:0:0:0:0:0:0 100_SECS: 0 HIST_TOTAL: 294547 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: rcu annotation fixup
On Wed, Jan 19, 2011 at 12:40:40AM +, Mel Gorman wrote: On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote: When built with rcu checks enabled, vhost triggers bogus warnings as vhost features are read without dev-mutex sometimes. Fixing it properly is not trivial as vhost.h does not know which lockdep classes it will be used under. Disable the warning by stubbing out the check for now. What is the harm in leaving the bogus warnings until the difficult fix happens? RCU checks enabled does not seem like something that is enabled in production. I would like to run with rcu checks enabled sometimes to debug kvm, which has an elaborate rcu strategy. Bogus warnings in the log make it easy to overlook the real ones. Further, the rcu macros used are a form of documentation. If we have - rcu_dereference_index_check(dev-acked_features, - lockdep_is_held(dev-mutex)); this means 'taken in rcu read side critical section or under mutex', + acked_features = rcu_dereference_index_check(dev-acked_features, 1); means 'not checked'. If this patch is applied, there is always the risk that it'll be simply forgotten about. Well, that's why I put in a TODO. If there's a demand for that, I can add a Kconfig option to trigger a warning at each unchecked rcu call in vhost-net but I doubt it'll get a lof of use :) -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. Maybe this is best handled by a documentation update? We always said: use vhost=on to enable experimental in kernel accelerator\n note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net, use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n use vhost=on to enable experimental in kernel accelerator\n +(note: vhost=on has no effect unless guest uses MSI-X)\n use 'vhostfd=h' to connect to an already opened vhost net device\n #endif -net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar The error is reported by virtio-pci which does not know about vhost. I started with EVIRTIO_MSIX_DISABLED and made is shorter. Would EVIRTIO_MSIX_DISABLED be better? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 06:23:36PM -0600, Anthony Liguori wrote: On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. OK, I will add that, probably as a separate flag as vhost is a boolean. This will get worse performance but it will be what the user asked for. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. Yes you can: check MSI enabled in the guest, if it is - check vhost enabled in the XML. Not that bad at all, is it? Regards, Anthony Liguori We get worse performance without MSI anyway, how is this different? Maybe this is best handled by a documentation update? We always said: use vhost=on to enable experimental in kernel accelerator\n note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net, use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n use vhost=on to enable experimental in kernel accelerator\n +(note: vhost=on has no effect unless guest uses MSI-X)\n use 'vhostfd=h' to connect to an already opened vhost net device\n #endif -net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 06:35:46PM -0700, Alex Williamson wrote: On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote: On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. If we add a force option, let's please distinguish hotplug from VM creation time. The latter can abort. Hotplug should print an error and fail the initfn. It can't abort at init - MSI is disabled at init, it needs to be enabled by the guest later. And aborting the guest in the middle of the run is a very bad idea. What vhostforce=true will do is force vhost backend to be used even if it is slower. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote: [ Trimmed Eric from CC list as vger was complaining that it is too long ] On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: Hi, I have constructed a test where I run an un-paced UDP_STREAM test in one guest and a paced omni rr test in another guest at the same time. Hmm, what is this supposed to measure? Basically each time you run an un-paced UDP_STREAM you get some random load on the network. You can't tell what it was exactly, only that it was between the send and receive throughput. Breifly I get the following results from the omni test.. 1. Omni test only:MEAN_LATENCY=272.00 2. Omni and stream test: MEAN_LATENCY=3423.00 3. cpu and net_cls group: MEAN_LATENCY=493.00 As per 2 plus cgoups are created for each guest and guest tasks added to the groups 4. 100Mbit/s class: MEAN_LATENCY=273.00 As per 3 plus the net_cls groups each have a 100MBit/s HTB class 5. cpu.shares=128:MEAN_LATENCY=652.00 As per 4 plus the cpu groups have cpu.shares set to 128 6. Busy CPUS: MEAN_LATENCY=15126.00 As per 5 but the CPUs are made busy using a simple shell while loop There is a bit of noise in the results as the two netperf invocations aren't started at exactly the same moment For reference, my netperf invocations are: netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12 netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200 foo contains PROTOCOL THROUGHPUT,THROUGHPUT_UNITS LOCAL_SEND_THROUGHPUT LOCAL_RECV_THROUGHPUT REMOTE_SEND_THROUGHPUT REMOTE_RECV_THROUGHPUT RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY LOCAL_CPU_UTIL,REMOTE_CPU_UTIL -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Fri, Jan 21, 2011 at 06:19:13AM -0700, Alex Williamson wrote: On Fri, 2011-01-21 at 11:55 +0200, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 06:35:46PM -0700, Alex Williamson wrote: On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote: On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. If we add a force option, let's please distinguish hotplug from VM creation time. The latter can abort. Hotplug should print an error and fail the initfn. It can't abort at init - MSI is disabled at init, it needs to be enabled by the guest later. And aborting the guest in the middle of the run is a very bad idea. Yeah, I was thinking about the ordering of device being added vs guest enabling MSI this morning. Waiting until the guest decides to try to start using the device to NAK it with an abort is very undesirable. What if when we have vhost=on,force, the device doesn't advertise an INTx (PCI_INTERRUPT_PIN = 0)? Alex Then we break backward compatibility with old guests. I don't see what the issue is really: It is trivial to check that the guest uses MSIX. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote: On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote: [ Trimmed Eric from CC list as vger was complaining that it is too long ] On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: Hi, I have constructed a test where I run an un-paced UDP_STREAM test in one guest and a paced omni rr test in another guest at the same time. Hmm, what is this supposed to measure? Basically each time you run an un-paced UDP_STREAM you get some random load on the network. You can't tell what it was exactly, only that it was between the send and receive throughput. Rick mentioned in another email that I messed up my test parameters a bit, so I will re-run the tests, incorporating his suggestions. What I was attempting to measure was the effect of an unpaced UDP_STREAM on the latency of more moderated traffic. Because I am interested in what effect an abusive guest has on other guests and how that my be mitigated. Could you suggest some tests that you feel are more appropriate? Yes. To refraze my concern in these terms, besides the malicious guest you have another software in host (netperf) that interferes with the traffic, and it cooperates with the malicious guest. Right? IMO for a malicious guest you would send UDP packets that then get dropped by the host. For example block netperf in host so that it does not consume packets from the socket. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote: When doing device assignment, we use cpu_register_physical_memory() to directly map the qemu mmap of the device resource into the address space of the guest. The unadvertised feature of the register physical memory code path on kvm, at least for this type of mapping, is that it needs to allocate an index from a small, fixed array of memory slots. Even better, if it can't get an index, the code aborts deep in the kvm specific bits, preventing the caller from having a chance to recover. It's really easy to hit this by hot adding too many assigned devices to a guest (pretty easy to hit with too many devices at instantiation time too, but the abort is slightly more bearable there). I'm assuming it's pretty difficult to make the memory slot array dynamically sized. If that's not the case, please let me know as that would be a much better solution. I'm not terribly happy with the solution in this series, it doesn't provide any guarantees whether a cpu_register_physical_memory() will succeed, only slightly better educated guesses. Are there better ideas how we could solve this? Thanks, Alex Put the table in qemu memory, make kvm access it with copy from/to user? It can then be any size ... --- Alex Williamson (2): device-assignment: Count required kvm memory slots kvm: Allow querying free slots hw/device-assignment.c | 59 +++- hw/device-assignment.h |3 ++ kvm-all.c | 16 + kvm.h |2 ++ 4 files changed, 79 insertions(+), 1 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote: On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote: On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote: On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote: [ Trimmed Eric from CC list as vger was complaining that it is too long ] On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: Hi, I have constructed a test where I run an un-paced UDP_STREAM test in one guest and a paced omni rr test in another guest at the same time. Hmm, what is this supposed to measure? Basically each time you run an un-paced UDP_STREAM you get some random load on the network. You can't tell what it was exactly, only that it was between the send and receive throughput. Rick mentioned in another email that I messed up my test parameters a bit, so I will re-run the tests, incorporating his suggestions. What I was attempting to measure was the effect of an unpaced UDP_STREAM on the latency of more moderated traffic. Because I am interested in what effect an abusive guest has on other guests and how that my be mitigated. Could you suggest some tests that you feel are more appropriate? Yes. To refraze my concern in these terms, besides the malicious guest you have another software in host (netperf) that interferes with the traffic, and it cooperates with the malicious guest. Right? Yes, that is the scenario in this test. Yes but I think that you want to put some controlled load on host. Let's assume that we impove the speed somehow and now you can push more bytes per second without loss. Result might be a regression in your test because you let the guest push as much as it can and suddenly it can push more data through. OTOH with packet loss the load on host is anywhere in between send and receive throughput: there's no easy way to measure it from netperf: the earlier some buffers overrun, the earlier the packets get dropped and the less the load on host. This is why I say that to get a specific load on host you want to limit the sender to a specific BW and then either - make sure packet loss % is close to 0. - make sure packet loss % is close to 100%. IMO for a malicious guest you would send UDP packets that then get dropped by the host. For example block netperf in host so that it does not consume packets from the socket. I'm more interested in rate-limiting netperf than blocking it. Well I mean netperf on host. But in any case, do you mean use iptables or tc based on classification made by net_cls? Just to block netperf you can send it SIGSTOP :) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote: Just to block netperf you can send it SIGSTOP :) Clever :) One could I suppose achieve the same result by making the remote receive socket buffer size smaller than the UDP message size and then not worry about having to learn the netserver's PID to send it the SIGSTOP. I *think* the semantics will be substantially the same? If you could set, it, yes. But at least linux ignores any value substantially smaller than 1K, and then multiplies that by 2: case SO_RCVBUF: /* Don't error on this BSD doesn't and if you think about it this is right. Otherwise apps have to play 'guess the biggest size' games. RCVBUF/SNDBUF are treated in BSD as hints */ if (val sysctl_rmem_max) val = sysctl_rmem_max; set_rcvbuf: sk-sk_userlocks |= SOCK_RCVBUF_LOCK; /* * We double it on the way in to account for * struct sk_buff etc. overhead. Applications * assume that the SO_RCVBUF setting they make will * allow that much actual data to be received on that * socket. * * Applications are unaware that struct sk_buff and * other overheads allocate from the receive buffer * during socket buffer allocation. * * And after considering the possible alternatives, * returning the value we actually used in getsockopt * is the most desirable behavior. */ if ((val * 2) SOCK_MIN_RCVBUF) sk-sk_rcvbuf = SOCK_MIN_RCVBUF; else sk-sk_rcvbuf = val * 2; and /* * Since sk_rmem_alloc sums skb-truesize, even a small frame might need * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak */ #define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff)) Both will be drops at the socket buffer, albeit for for different reasons. The too small socket buffer version though doesn't require one remember to wake the netserver in time to have it send results back to netperf without netperf tossing-up an error and not reporting any statistics. Also, netperf has a no control connection mode where you can, in effect cause it to send UDP datagrams out into the void - I put it there to allow folks to test against the likes of echo discard and chargen services but it may have a use here. Requires that one specify the destination IP and port for the data connection explicitly via the test-specific options. In that mode the only stats reported are those local to netperf rather than netserver. Ah, sounds perfect. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote: Michael S. Tsirkin wrote: On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote: Just to block netperf you can send it SIGSTOP :) Clever :) One could I suppose achieve the same result by making the remote receive socket buffer size smaller than the UDP message size and then not worry about having to learn the netserver's PID to send it the SIGSTOP. I *think* the semantics will be substantially the same? If you could set, it, yes. But at least linux ignores any value substantially smaller than 1K, and then multiplies that by 2: case SO_RCVBUF: /* Don't error on this BSD doesn't and if you think about it this is right. Otherwise apps have to play 'guess the biggest size' games. RCVBUF/SNDBUF are treated in BSD as hints */ if (val sysctl_rmem_max) val = sysctl_rmem_max; set_rcvbuf: sk-sk_userlocks |= SOCK_RCVBUF_LOCK; /* * We double it on the way in to account for * struct sk_buff etc. overhead. Applications * assume that the SO_RCVBUF setting they make will * allow that much actual data to be received on that * socket. * * Applications are unaware that struct sk_buff and * other overheads allocate from the receive buffer * during socket buffer allocation. * * And after considering the possible alternatives, * returning the value we actually used in getsockopt * is the most desirable behavior. */ if ((val * 2) SOCK_MIN_RCVBUF) sk-sk_rcvbuf = SOCK_MIN_RCVBUF; else sk-sk_rcvbuf = val * 2; and /* * Since sk_rmem_alloc sums skb-truesize, even a small frame might need * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak */ #define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff)) Pity - seems to work back on 2.6.26: Hmm, that code is there at least as far back as 2.6.12. raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 1249281024 10.00 2882334 02361.17 256 10.00 0 0.00 raj@tardy:~/netperf2_trunk$ uname -a Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux Still, even with that (or SIGSTOP) we don't really know where the packets were dropped right? There is no guarantee they weren't dropped before they got to the socket buffer happy benchmarking, rick jones Right. Better send to a port with no socket listening there, that would drop the packet at an early (if not at the earliest possible) opportunity. PS - here is with a -S 1024 option: raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 1249281024 10.00 1679269 01375.64 2048 10.00 1490662 1221.13 showing that there is a decent chance that many of the frames were dropped at the socket buffer, but not all - I suppose I could/should be checking netstat stats... :) And just a little more, only because I was curious :) raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 124928 257 10.00 1869134 0 384.29 262142 10.00 1869134384.29 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : histogram Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 124928 257 10.00 3076363 0 632.49 256 10.00 0 0.00 -- To unsubscribe from this list: send the line unsubscribe kvm in the body
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 07:41:32AM -0700, Alex Williamson wrote: On Tue, 2011-01-25 at 08:36 +0100, Jan Kiszka wrote: On 2011-01-25 06:37, Alex Williamson wrote: On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote: I'll look at how we might be able to allocate slots on demand. Thanks, Here's a first cut just to see if this looks agreeable. This allows the slot array to grow on demand. This works with current userspace, as well as userspace trivially modified to double KVMState.slots and hotplugging enough pci-assign devices to exceed the previous limit (w/ w/o ept). Hopefully I got the rcu bits correct. Does this look like the right path? If so, I'll work on removing the fixed limit from userspace next. Thanks, Alex kvm: Allow memory slot array to grow on demand Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array to grow on demand. Private slots are now allocated at the front instead of the end. Only x86 seems to use private slots, Hmm, doesn't current user space expect slots 8..11 to be the private ones and wouldn't it cause troubles if slots 0..3 are suddenly reserved? The private slots aren't currently visible to userspace, they're actually slots 32..35. The patch automatically increments user passed slot ids so userspace has it's own zero-based view of the array. Frankly, I don't understand why userspace reserves slots 8..11, is this compatibility with older kernel implementations? so this is now zero for all other archs. The memslots pointer is already updated using rcu, so changing the size off the array when it's replaces is straight forward. x86 also keeps a bitmap of slots used by a kvm_mmu_page, which requires a shadow tlb flush whenever we increase the number of slots. This forces the pages to be rebuilt with the new bitmap size. Is it possible for user space to increase the slot number to ridiculous amounts (at least as far as kmalloc allows) and then trigger a kernel walk through them in non-preemptible contexts? Just wondering, I haven't checked all contexts of functions like kvm_is_visible_gfn yet. If yes, we should already switch to rbtree or something like that. Otherwise that may wait a bit, but probably not too long. Yeah, Avi has brought up the hole that userspace can exploit this interface with these changes. However, for 99+% of users, this change leaves the slot array at about the same size, or makes it smaller. Only huge, scale-out guests would probably even see a doubling of slots (my guest with 14 82576 VFs uses 48 slots). On the kernel side, I think we can safely save a tree implementation as a later optimization should we determine it's necessary. We'll have to see how the userspace side matches to figure out what's best there. Thanks, Alex Also, is there any time where we need to scan all slots on data path? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 12:20:03PM +0200, Avi Kivity wrote: On 01/24/2011 11:32 AM, Marcelo Tosatti wrote: On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote: When doing device assignment, we use cpu_register_physical_memory() to directly map the qemu mmap of the device resource into the address space of the guest. The unadvertised feature of the register physical memory code path on kvm, at least for this type of mapping, is that it needs to allocate an index from a small, fixed array of memory slots. Even better, if it can't get an index, the code aborts deep in the kvm specific bits, preventing the caller from having a chance to recover. It's really easy to hit this by hot adding too many assigned devices to a guest (pretty easy to hit with too many devices at instantiation time too, but the abort is slightly more bearable there). I'm assuming it's pretty difficult to make the memory slot array dynamically sized. If that's not the case, please let me know as that would be a much better solution. Its not difficult to either increase the maximum number (defined as 32 now in both qemu and kernel) of static slots, or support dynamic increases, if it turns out to be a performance issue. We can't make it unbounded in the kernel, since a malicious user could start creating an infinite amount of memory slots, pinning unbounded kernel memory. How about keeping the slots in userspace memory, access them with copy from user? If we make the limit much larger, we should start to think about efficiency. Every mmio vmexit is currently a linear scan of the memory slot table, which is efficient at a small number of slots, but not at a large number. We could conceivably encode the no slot information into a bit in the not-present spte. OK, but the slots that Alex here wants to use are presumably mostly not resulting in a pagefault at all, right? Can we split such guys out so they don't slow mmio lookup? Maybe keep *these* in userspace memory. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote: For the other lookups, which we believe will succeed, we can assume the probablity of a match is related to the slot size, and sort the slots by page count. Unlikely to be true for assigned device BARs. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 04:58:41PM +0200, Avi Kivity wrote: On 01/25/2011 04:55 PM, Michael S. Tsirkin wrote: We can't make it unbounded in the kernel, since a malicious user could start creating an infinite amount of memory slots, pinning unbounded kernel memory. How about keeping the slots in userspace memory, access them with copy from user? Some of the data is validated by the kernel, so it needs a kernel copy. Other fields are completely internal to the kernel. If we make the limit much larger, we should start to think about efficiency. Every mmio vmexit is currently a linear scan of the memory slot table, which is efficient at a small number of slots, but not at a large number. We could conceivably encode the no slot information into a bit in the not-present spte. OK, but the slots that Alex here wants to use are presumably mostly not resulting in a pagefault at all, right? Can we split such guys out so they don't slow mmio lookup? Maybe keep *these* in userspace memory. The algorithm is: page fault: for each slot: if addr in slot: populate spte return # no slot matches mmio so we have to try out all slots before we find out none of them are needed. I see now. Yes, a flag in spte would help. changes in slots would then have to update all these flags. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote: On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote: For the other lookups, which we believe will succeed, we can assume the probablity of a match is related to the slot size, and sort the slots by page count. Unlikely to be true for assigned device BARs. We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and lots of small slots for BARs and such. The vast majority of faults will either touch one of the two largest slots, or will miss all slots. Relatively few will hit the small slots. Not if you are using one of the assigned devices and don't do any faults on memory :) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Tue, Jan 25, 2011 at 07:34:18PM +0200, Avi Kivity wrote: On 01/25/2011 05:23 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 04:58:41PM +0200, Avi Kivity wrote: On 01/25/2011 04:55 PM, Michael S. Tsirkin wrote: We can't make it unbounded in the kernel, since a malicious user could start creating an infinite amount of memory slots, pinning unbounded kernel memory. How about keeping the slots in userspace memory, access them with copy from user? Some of the data is validated by the kernel, so it needs a kernel copy. Other fields are completely internal to the kernel. If we make the limit much larger, we should start to think about efficiency. Every mmio vmexit is currently a linear scan of the memory slot table, which is efficient at a small number of slots, but not at a large number. We could conceivably encode the no slot information into a bit in the not-present spte. OK, but the slots that Alex here wants to use are presumably mostly not resulting in a pagefault at all, right? Can we split such guys out so they don't slow mmio lookup? Maybe keep *these* in userspace memory. The algorithm is: page fault: for each slot: if addr in slot: populate spte return # no slot matches mmio so we have to try out all slots before we find out none of them are needed. I see now. Yes, a flag in spte would help. changes in slots would then have to update all these flags. That's easy, we drop all sptes. Ah, right. Hmm cpu has no flag to distinguish mmio sptes somehow already? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote: On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote: On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote: For the other lookups, which we believe will succeed, we can assume the probablity of a match is related to the slot size, and sort the slots by page count. Unlikely to be true for assigned device BARs. We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and lots of small slots for BARs and such. The vast majority of faults will either touch one of the two largest slots, or will miss all slots. Relatively few will hit the small slots. Not if you are using one of the assigned devices and don't do any faults on memory :) It's impossible not to fault on memory. No I mean the RAM. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote: On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote: On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote: On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote: On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote: For the other lookups, which we believe will succeed, we can assume the probablity of a match is related to the slot size, and sort the slots by page count. Unlikely to be true for assigned device BARs. We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and lots of small slots for BARs and such. The vast majority of faults will either touch one of the two largest slots, or will miss all slots. Relatively few will hit the small slots. Not if you are using one of the assigned devices and don't do any faults on memory :) It's impossible not to fault on memory. No I mean the RAM. No idea what you mean. It's impossible not to fault on RAM, either (unless you don't use it at all). I just mean that once you fault you map sptes and then you can use them without exits. mmio will cause exits each time. Right? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Wed, Jan 26, 2011 at 11:54:21AM +0200, Avi Kivity wrote: On 01/26/2011 11:39 AM, Michael S. Tsirkin wrote: On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote: On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote: On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote: On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote: On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote: On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote: For the other lookups, which we believe will succeed, we can assume the probablity of a match is related to the slot size, and sort the slots by page count. Unlikely to be true for assigned device BARs. We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and lots of small slots for BARs and such. The vast majority of faults will either touch one of the two largest slots, or will miss all slots. Relatively few will hit the small slots. Not if you are using one of the assigned devices and don't do any faults on memory :) It's impossible not to fault on memory. No I mean the RAM. No idea what you mean. It's impossible not to fault on RAM, either (unless you don't use it at all). I just mean that once you fault you map sptes and then you can use them without exits. mmio will cause exits each time. Right? The swapper scanning sptes, ksmd, khugepaged, and swapping can all cause a page to be unmapped. Though it should certainly happen with a much lower frequency than mmio. Right. That's why I say that sorting by size might not be optimal. Maybe a cache ... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote: I am working on a KVM network performance issue found in our lab running the DayTrader benchmark. The benchmark throughput takes a significant hit when running the application server in a KVM guest verses on bare metal. We have dug into the problem and found that DayTrader's use of small packets exposes KVM's overhead of handling network packets. I have been able to reproduce the performance hit with a simpler setup using the netperf benchmark with the TCP_RR test and the request and response sizes set to 256 bytes. I run the benchmark between two physical systems, each using a 1GB link. In order to get the maximum throughput for the system I have to run 100 instances of netperf. When I run the netserver processes in a guest, I see a maximum throughput that is 51% of what I get if I run the netserver processes directly on the host. The CPU utilization in the guest is only 85% at maximum throughput, whereas it is 100% on bare metal. The KVM host has 16 CPUs. The KVM guest is configured with 2 VCPUs. When I run netperf on the host I boot the host with maxcpus=2 on the kernel command line. The host is running the current KVM upstream kernel along with the current upstream qemu. Here is the qemu command used to launch the guest: /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2 -vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize We have tried various proposed fixes, each with varying amounts of success. One such fix was to add code to the vhost thread such that when it found the work queue empty it wouldn't just exit the thread but rather would delay for 50 microseconds and then recheck the queue. If there was work on the queue it would loop back and process it, else it would exit the thread. The change got us a 13% improvement in the DayTrader throughput. Running the same netperf configuration on the same hardware but using a different hypervisor gets us significantly better throughput numbers. The guest on that hypervisor runs at 100% CPU utilization. The various fixes we have tried have not gotten us close to the throughput seen on the other hypervisor. I'm looking for ideas/input from the KVM experts on how to make KVM perform better when handling small packets. Thanks, Steve I am seeing a similar problem, and am trying to fix that. My current theory is that this is a variant of a receive livelock: if the application isn't fast enough to process incoming data, the guest net stack switches from prequeue to backlog handling. One thing I noticed is that locking the vhost thread and the vcpu to the same physical CPU almost doubles the bandwidth. Can you confirm that in your setup? My current guess is that when we lock both to a single CPU, netperf in guest gets scheduled slowing down the vhost thread in the host. I also noticed that this specific workload performs better with vhost off: presumably we are loading the guest less. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Thu, Jan 27, 2011 at 11:21:47AM +0200, Avi Kivity wrote: On 01/26/2011 02:08 PM, Michael S. Tsirkin wrote: I just mean that once you fault you map sptes and then you can use them without exits. mmio will cause exits each time. Right? The swapper scanning sptes, ksmd, khugepaged, and swapping can all cause a page to be unmapped. Though it should certainly happen with a much lower frequency than mmio. Right. That's why I say that sorting by size might not be optimal. Maybe a cache ... Why would it not be optimal? If you have 16GB RAM in two slots and a few megabytes here and there scattered in some slots, you have three orders of magnitudes to spare. Yes but you might not be accessing all that RAM. Maybe your workset is just tens of megabytes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Thu, Jan 27, 2011 at 11:26:19AM +0200, Michael S. Tsirkin wrote: On Thu, Jan 27, 2011 at 11:21:47AM +0200, Avi Kivity wrote: On 01/26/2011 02:08 PM, Michael S. Tsirkin wrote: I just mean that once you fault you map sptes and then you can use them without exits. mmio will cause exits each time. Right? The swapper scanning sptes, ksmd, khugepaged, and swapping can all cause a page to be unmapped. Though it should certainly happen with a much lower frequency than mmio. Right. That's why I say that sorting by size might not be optimal. Maybe a cache ... Why would it not be optimal? If you have 16GB RAM in two slots and a few megabytes here and there scattered in some slots, you have three orders of magnitudes to spare. Yes but you might not be accessing all that RAM. Maybe your workset is just tens of megabytes. Anyway, what I am saying this is all just guesswork. Some kind of cache sounds, to me, like a better approach if we have a ton of slots. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Thu, Jan 27, 2011 at 11:28:12AM +0200, Avi Kivity wrote: On 01/27/2011 11:26 AM, Michael S. Tsirkin wrote: Right. That's why I say that sorting by size might not be optimal. Maybe a cache ... Why would it not be optimal? If you have 16GB RAM in two slots and a few megabytes here and there scattered in some slots, you have three orders of magnitudes to spare. Yes but you might not be accessing all that RAM. Maybe your workset is just tens of megabytes. Great, then you fault it in and never touch the slots array again. Yep. Except for emulated mmio :) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio_blk: allow re-reading config space at runtime
On Fri, Jan 14, 2011 at 05:01:37PM +0100, Christoph Hellwig wrote: Wire up the virtio_driver config_changed method to get notified about config changes raised by the host. For now we just re-read the device size to support online resizing of devices, but once we add more attributes that might be changeable they could be added as well. Note that the config_changed method is called from irq context, so we'll have to use the workqueue infrastructure to provide us a proper user context for our changes. Signed-off-by: Christoph Hellwig h...@lst.de Index: xfs/drivers/block/virtio_blk.c === --- xfs.orig/drivers/block/virtio_blk.c 2011-01-13 18:17:23.730254665 +0100 +++ xfs/drivers/block/virtio_blk.c2011-01-14 16:57:50.572032906 +0100 @@ -6,10 +6,12 @@ #include linux/virtio.h #include linux/virtio_blk.h #include linux/scatterlist.h +#include linux/string_helpers.h #define PART_BITS 4 static int major, index; +struct workqueue_struct *virtblk_wq; struct virtio_blk { @@ -42,6 +44,11 @@ struct virtblk_req u8 status; }; +struct virtblk_config_change { + struct virtio_device *vdev; + struct work_struct work; +}; + static void blk_done(struct virtqueue *vq) { struct virtio_blk *vblk = vq-vdev-priv; @@ -291,6 +298,57 @@ static ssize_t virtblk_serial_show(struc } DEVICE_ATTR(serial, S_IRUGO, virtblk_serial_show, NULL); +static void virtblk_config_changed_work(struct work_struct *work) +{ + struct virtblk_config_change *cfg = + container_of(work, struct virtblk_config_change, work); + struct virtio_device *vdev = cfg-vdev; + struct virtio_blk *vblk = vdev-priv; + struct request_queue *q = vblk-disk-queue; + char cap_str_2[10], cap_str_10[10]; + u64 capacity, size; + + /* Host must always specify the capacity. */ + vdev-config-get(vdev, offsetof(struct virtio_blk_config, capacity), + capacity, sizeof(capacity)); + + /* If capacity is too big, truncate with warning. */ + if ((sector_t)capacity != capacity) { + dev_warn(vdev-dev, Capacity %llu too large: truncating\n, + (unsigned long long)capacity); + capacity = (sector_t)-1; + } + + size = capacity * queue_logical_block_size(q); + string_get_size(size, STRING_UNITS_2, cap_str_2, sizeof(cap_str_2)); + string_get_size(size, STRING_UNITS_10, cap_str_10, sizeof(cap_str_10)); + + dev_notice(vdev-dev, + new size: %llu %d-byte logical blocks (%s/%s)\n, + (unsigned long long)capacity, + queue_logical_block_size(q), + cap_str_10, cap_str_2); + + set_capacity(vblk-disk, capacity); + +} + +static void virtblk_config_changed(struct virtio_device *vdev) +{ + struct virtblk_config_change *cfg; + + cfg = kmalloc(sizeof(*cfg), GFP_ATOMIC); + if (!cfg) { + dev_info(vdev-dev, skipping config change\n); + return; + } + + cfg-vdev = vdev; + INIT_WORK(cfg-work, virtblk_config_changed_work); + queue_work(virtblk_wq, cfg-work); This needs to be flushed on device removal, I think, otherwise the vdev pointer will go stale. +} + + Two empty lines :) static int __devinit virtblk_probe(struct virtio_device *vdev) { struct virtio_blk *vblk; @@ -508,27 +566,47 @@ static unsigned int features[] = { * Use __refdata to avoid this warning. */ static struct virtio_driver __refdata virtio_blk = { - .feature_table = features, - .feature_table_size = ARRAY_SIZE(features), - .driver.name = KBUILD_MODNAME, - .driver.owner = THIS_MODULE, - .id_table = id_table, - .probe =virtblk_probe, - .remove = __devexit_p(virtblk_remove), + .feature_table = features, + .feature_table_size = ARRAY_SIZE(features), + .driver.name= KBUILD_MODNAME, + .driver.owner = THIS_MODULE, + .id_table = id_table, + .probe = virtblk_probe, + .remove = __devexit_p(virtblk_remove), + .config_changed = virtblk_config_changed, }; static int __init init(void) { + int error; + + virtblk_wq = alloc_workqueue(md_misc, 0, 0); + if (!virtblk_wq) + return -ENOMEM; + major = register_blkdev(0, virtblk); - if (major 0) - return major; - return register_virtio_driver(virtio_blk); + if (major 0) { + error = major; + goto out_destroy_workqueue; + } + + error = register_virtio_driver(virtio_blk); + if (error) + goto out_unregister_blkdev; + return 0; + +out_unregister_blkdev: + unregister_blkdev(major, virtblk); +out_destroy_workqueue: +
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 10:44:34AM -0800, Shirley Ma wrote: On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote: I am seeing a similar problem, and am trying to fix that. My current theory is that this is a variant of a receive livelock: if the application isn't fast enough to process incoming data, the guest net stack switches from prequeue to backlog handling. One thing I noticed is that locking the vhost thread and the vcpu to the same physical CPU almost doubles the bandwidth. Can you confirm that in your setup? My current guess is that when we lock both to a single CPU, netperf in guest gets scheduled slowing down the vhost thread in the host. I also noticed that this specific workload performs better with vhost off: presumably we are loading the guest less. I found similar issue for small message size TCP_STREAM test when guest as TX. I found when I slow down TX, the BW performance will be doubled for 1K to 4K message size. Shirley Interesting. In particular running vhost and the transmitting guest on the same host would have the effect of slowing down TX. Does it double the BW for you too? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 11:09:00AM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote: Interesting. In particular running vhost and the transmitting guest on the same host would have the effect of slowing down TX. Does it double the BW for you too? Running vhost and TX guest on the same host seems not good enough to slow down TX. In order to gain the double even triple BW for guest TX to local host I still need to play around, so 1K message size, BW is able to increase from 2.XGb/s to 6.XGb/s. Thanks Shirley Well slowing down the guest does not sound hard - for example we can request guest notifications, or send extra interrupts :) A slightly more sophisticated thing to try is to poll the vq a bit more aggressively. For example if we handled some requests and now tx vq is empty, reschedule and yeild. Worth a try? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 11:45:47AM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote: Well slowing down the guest does not sound hard - for example we can request guest notifications, or send extra interrupts :) A slightly more sophisticated thing to try is to poll the vq a bit more aggressively. For example if we handled some requests and now tx vq is empty, reschedule and yeild. Worth a try? I used dropping packets in high level to slow down TX. I am still thinking what's the right the approach here. Interesting. Could this is be a variant of the now famuous bufferbloat then? I guess we could drop some packets if we see we are not keeping up. For example if we see that the ring is X% full, we could quickly complete Y% without transmitting packets on. Or maybe we should drop some bytes not packets. Requesting guest notification and extra interrupts is what we want to avoid to reduce VM exits for saving CPUs. I don't think it's good. Yes but how do you explain regression? One simple theory is that guest net stack became faster and so the host can't keep up. By polling the vq a bit more aggressively, you meant vhost, right? Shirley Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote: Interesting. Could this is be a variant of the now famuous bufferbloat then? Sigh, bufferbloat is the new global warming... :-/ Yep, some places become colder, some other places become warmer; Same as BW results, sometimes faster, sometimes slower. :) Shirley OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; + if (unlikely(packets_coalesced tx_packets_coalesce || +bytes_coalesced tx_bytes_coalesce || +bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { int r; struct vhost_net *net = qemu_malloc(sizeof *net); @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) (1 VHOST_NET_F_VIRTIO_NET_HDR); net-backend = r; -r = vhost_dev_init(net-dev, devfd); +r = vhost_dev_init(net-dev, devfd, force); if (r 0) { goto fail; } @@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net) qemu_free(net); } #else -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { return NULL; } diff --git a/hw/vhost_net.h b/hw/vhost_net.h index 6c18ff7..8435094 100644 --- a/hw/vhost_net.h +++ b/hw/vhost_net.h @@ -6,7 +6,7 @@ struct vhost_net; typedef struct vhost_net VHostNetState; -VHostNetState *vhost_net_init(VLANClientState *backend, int devfd); +VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote: On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex I'm not entirely happy with the API either, but I'm sure playing with the function pointer is not the solution we are looking for: msix is enabled/disabled dynamically. force is static but it is something net backend knows about, virtio pci doesn't. So I suspect modifying the callback on the fly will become messy quite quickly. Well, this API will be easy to tweak after the fact, there's a single user after all. hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Mon, Jan 31, 2011 at 03:07:49PM -0700, Alex Williamson wrote: On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote: On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote: On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex I'm not entirely happy with the API either, but I'm sure playing with the function pointer is not the solution we are looking for: msix is enabled/disabled dynamically. force is static but it is something net backend knows about, virtio pci doesn't. So I suspect modifying the callback on the fly will become messy quite quickly. Isn't there a callback to the device when msix is enabled/disabled? Not at the moment. In qemu we have mask/unmask. Well, this API will be easy to tweak after the fact, there's a single user after all. Gosh, I wish I got to take that attitude when I submit code... ;) Alex You are probably dealing with more important issues :) If you have an internal API used in a single place, you get to change it at will. If you tweak an API used all over the codebase, you don't :) hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Correct. Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote: On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Yes. It really should be if (likely(packets_coalesced bytes_coalesced bufs_coalesced)) vhost_signal(net-dev, vq); Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] vhost-net: 2.6.38 - warning fix
Please merge for 2.6.38. Not the most elegant fix, but it does fix the noise in dmesg that interferes with kvm debugging, and probably the best we can do for 2.6.38. The following changes since commit fca540ab5f4718c6133f71f7be1793066008bf89: enc28j60: Fix reading of transmit status vector (2011-01-31 20:56:54 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net Michael S. Tsirkin (1): vhost: rcu annotation fixup drivers/vhost/net.c |9 + drivers/vhost/vhost.h |6 +++--- 2 files changed, 8 insertions(+), 7 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 dontapply] vhost-net tx tuning
OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By default with it we will only interrupt when we see an empty ring. Which is liklely too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value like half ring * 200, or packets some small value etc. Set any one parameter to 0 to get current behaviour (interrupt immediately when enabled). Warning: completely untested. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 10; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 10; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 10; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += out; + if (unlikely(packets_coalesced tx_packets_coalesce || +bytes_coalesced tx_bytes_coalesce || +bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced + bytes_coalesced + bufs_coalesced)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3 0.14] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- OK this is 0.14 material so quick review would be appreciated. This version's compiled only, I'll naturally test more before I queue it. Changes from v2: get rid of EVIRTIO, add a separate API to query for fast guest notifiers Changes from v1: add vhostforce flag hw/vhost.c | 10 +- hw/vhost.h |4 +++- hw/vhost_net.c | 24 ++-- hw/vhost_net.h |3 ++- hw/virtio-net.c |6 +- hw/virtio-pci.c |7 +++ hw/virtio.h |1 + net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 52 insertions(+), 13 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..38cc3b3 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -627,6 +628,13 @@ void vhost_dev_cleanup(struct vhost_dev *hdev) close(hdev-control); } +bool vhost_dev_query(struct vhost_dev *hdev, VirtIODevice *vdev) +{ +return !vdev-binding-query_guest_notifiers || +vdev-binding-query_guest_notifiers(vdev-binding_opaque) || +hdev-force; +} + int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) { int i, r; diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..c8c595a 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,10 +38,12 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); +bool vhost_dev_query(struct vhost_dev *hdev, VirtIODevice *vdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..420e05f 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { int r; struct vhost_net *net = qemu_malloc(sizeof *net); @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) (1 VHOST_NET_F_VIRTIO_NET_HDR); net-backend = r; -r = vhost_dev_init(net-dev, devfd); +r = vhost_dev_init(net-dev, devfd, force); if (r 0) { goto fail; } @@ -121,6 +122,11 @@ fail: return NULL; } +bool vhost_net_query(VHostNetState *net, VirtIODevice *dev) +{ +return vhost_dev_query(net-dev, dev); +} + int vhost_net_start(struct vhost_net *net, VirtIODevice *dev) { @@ -188,15 +194,21 @@ void vhost_net_cleanup(struct vhost_net *net) qemu_free(net); } #else -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) +{ +return NULL; +} + +bool vhost_net_query(VHostNetState *net, VirtIODevice *dev) { - return NULL; +return false; } int vhost_net_start(struct vhost_net *net, VirtIODevice *dev) { - return -ENOSYS; +return -ENOSYS; } void vhost_net_stop(struct vhost_net *net, VirtIODevice *dev) @@ -209,7 +221,7 @@ void vhost_net_cleanup(struct vhost_net *net) unsigned vhost_net_get_features(struct vhost_net *net, unsigned features) { - return features; +return features; } void vhost_net_ack_features(struct vhost_net *net, unsigned features) { diff --git a/hw/vhost_net.h b/hw/vhost_net.h index 6c18ff7..91e40b1 100644 --- a/hw/vhost_net.h +++ b/hw/vhost_net.h @@ -6,8 +6,9 @@ struct vhost_net; typedef struct vhost_net VHostNetState; -VHostNetState *vhost_net_init(VLANClientState *backend, int devfd); +VHostNetState *vhost_net_init(VLANClientState
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 12:25:08PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote: On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote: Interesting. Could this is be a variant of the now famuous bufferbloat then? Sigh, bufferbloat is the new global warming... :-/ Yep, some places become colder, some other places become warmer; Same as BW results, sometimes faster, sometimes slower. :) Shirley Sent a tuning patch (v2) that might help. Could you try it and play with the module parameters please? Hello Michael, Sure I will play with this patch to see how it could help. I am looking at guest side as well, I found a couple issues on guest side: 1. free_old_xmit_skbs() should return the number of skbs instead of the total of sgs since we are using ring size to stop/start netif queue. static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) { struct sk_buff *skb; unsigned int len, tot_sgs = 0; while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } return tot_sgs; should return numbers of skbs to track ring usage here, I think; } Did the old guest use number of buffers to track ring usage before? 2. In start_xmit, I think we should move capacity += free_old_xmit_skbs before netif_stop_queue(); so we avoid unnecessary netif queue stop/start. This condition is heavily hit for small message size. Also we capacity checking condition should change to something like half of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS? if (capacity 2+MAX_SKB_FRAGS) { netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb(vi-svq))) { /* More just got used, free them then recheck. */ capacity += free_old_xmit_skbs(vi); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } } } 3. Looks like the xmit callback is only used to wake the queue when the queue has stopped, right? Should we put a condition check here? static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ --- if (netif_queue_stopped(vi-dev)) netif_wake_queue(vi-dev); } Shirley Well the return value is used to calculate capacity and that counts the # of s/g. No? Nope, the current guest kernel uses descriptors not number of sgs. Confused. We compare capacity to skb frags, no? That's sg I think ... not sure the old guest. From cache utilization POV it might be better to read from the skb and not peek at virtio header though... Pls Cc the lists on any discussions in the future. -- MST Sorry I missed reply all. :( Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote: On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote: Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Reducing the signaling will reduce the CPU utilization by reducing VM exits. The small message BW is a problem we have seen faster guest/slow vhost, even I increased VHOST_NET_WEIGHT times, it didn't help that much for BW. For large message size, vhost is able to process all packets on time. I played around with guest/host codes, I only see huge BW improvement by dropping packets on guest side so far. Thanks Shirley My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. If you try my patch with bufs threshold set to e.g. half the vq, what we will do is send interrupt after we have processed half the vq. So host has half the vq to go, and guest has half the vq to fill. See? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote: My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. From the printk debugging output, it might not be exactly the case. The ring gets full, run a bit, then gets full, then run a bit, then full... Yes, but does it get even half empty in between? If you try my patch with bufs threshold set to e.g. half the vq, what we will do is send interrupt after we have processed half the vq. So host has half the vq to go, and guest has half the vq to fill. See? I am cleaning up my set up to run your patch ... Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 dontapply] vhost-net tx tuning
On Tue, Feb 01, 2011 at 03:07:38PM -0800, Sridhar Samudrala wrote: On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By default with it we will only interrupt when we see an empty ring. Which is liklely too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value like half ring * 200, or packets some small value etc. Set any one parameter to 0 to get current behaviour (interrupt immediately when enabled). Warning: completely untested. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 10; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 10; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 10; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += out; + if (unlikely(packets_coalesced tx_packets_coalesce || +bytes_coalesced tx_bytes_coalesce || +bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); I think the counters that exceed the limits need to be reset to 0 here. Otherwise we keep signaling for every buffer once we hit this condition. Thanks Sridhar Correct, good catch. + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced + bytes_coalesced + bufs_coalesced)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote: There are flags for bytes, buffers and packets. Try playing with any one of them :) Just be sure to use v2. I would like to change it to half of the ring size instead for signaling. Is that OK? Shirley Sure that is why I made it a parameter so you can experiment. The initial test results shows that the CPUs utilization has been reduced some, and BW has increased some with the default parameters, like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization down from 4x% to 38%, (Similar results from the patch I submitted a while ago to reduce signaling on vhost) but far away from dropping packet results. I am going to change the code to use 1/2 ring size to wake the netif queue. Shirley Just tweak the parameters with sysfs, you do not have to edit the code: echo 64 /sys/module/vhost_net/parameters/tx_bufs_coalesce Or in a similar way for tx_packets_coalesce (since we use indirect, packets will typically use 1 buffer each). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. I am not sure if this is the problem - shouldn't you see these messages: if (likely(capacity == -ENOMEM)) { dev_warn(dev-dev, TX queue failure: out of memory\n); } else { dev-stats.tx_fifo_errors++; dev_warn(dev-dev, Unexpected TX queue failure: %d\n, capacity); } in next xmit? I am not getting this in my testing. Yes, I don't think we hit this in our testing, simply because we don't stress memory. Disable indirect, then you might see this. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... Your vhost patch should fix this automatically. Right? Reduce the chance of it happening, yes. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. Interesting. Yes, I agree an API extension would be helpful. However, wouldn't just the signaling reduction be enough, without guest changes? However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. Thanks Shirley Which patch was that? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. One problem is slowing down the guest helps here. So there's a chance that just by adding complexity in guest driver we get a small improvement :( We can't rely on a patched guest anyway, so I think it is best to test guest and host changes separately. And I do agree something needs to be done in guest too, for example when vqs share an interrupt, we might invoke a callback when we see vq is not empty even though it's not requested. Probably should check interrupts enabled here? The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. Likely new kernel is faster :) For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley Yes, it seems unrelated to tx interrupts. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley Ah, excellent. What were the parameters? I used half of the ring size 129 for packet counters, but the performance is still not as good as dropping packets on guest, 3.7 Gb/s vs. 6.2Gb/s. Shirley And this is with sndbuf=0 in host, yes? And do you see a lot of tx interrupts? How packets per interrupt? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. One problem is slowing down the guest helps here. So there's a chance that just by adding complexity in guest driver we get a small improvement :( We can't rely on a patched guest anyway, so I think it is best to test guest and host changes separately. And I do agree something needs to be done in guest too, for example when vqs share an interrupt, we might invoke a callback when we see vq is not empty even though it's not requested. Probably should check interrupts enabled here? Yes, I modified xmit callback something like below: static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ if (netif_queue_stopped(vi-dev)) { free_old_xmit_skbs(vi); if (virtqueue_free_size(svq) = svq-vring.num / 2) { virtqueue_enable_cb(svq); return; } } netif_wake_queue(vi-dev); } OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. Likely new kernel is faster :) For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Do you have anything in mind on how to reduce vhost latency? Thanks Shirley Hmm, bypassing the bridge might help a bit. Are you using tap+bridge or macvtap? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley Ah, excellent. What were the parameters? I used half of the ring size 129 for packet counters, but the performance is still not as good as dropping packets on guest, 3.7 Gb/s vs. 6.2Gb/s. Shirley How many packets and bytes per interrupt are sent? Also, what about other values for the counters and other counters? What does your patch do? Just drop packets instead of stopping the interface? To have an understanding when should we drop packets in the guest, we need to know *why* does it help. Otherwise, how do we know it will work for others? Note that qdisc will drop packets when it overruns - so what is different? Also, are we over-running some other queue somewhere? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. What kind of coordination? With a patched vhost, and a full ring. you should get an interrupt per 100 packets. Is this what you see? And if yes, isn't the guest patch doing nothing then? We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Which hack was that? Do you have anything in mind on how to reduce vhost latency? Thanks Shirley Hmm, bypassing the bridge might help a bit. Are you using tap+bridge or macvtap? I am using tap+bridge for TCP_RR test, I think Steven tested macvtap before. He might have some data from his workload performance measurement. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote: I am working on a KVM network performance issue found in our lab running the DayTrader benchmark. The benchmark throughput takes a significant hit when running the application server in a KVM guest verses on bare metal. We have dug into the problem and found that DayTrader's use of small packets exposes KVM's overhead of handling network packets. I have been able to reproduce the performance hit with a simpler setup using the netperf benchmark with the TCP_RR test and the request and response sizes set to 256 bytes. I run the benchmark between two physical systems, each using a 1GB link. In order to get the maximum throughput for the system I have to run 100 instances of netperf. When I run the netserver processes in a guest, I see a maximum throughput that is 51% of what I get if I run the netserver processes directly on the host. The CPU utilization in the guest is only 85% at maximum throughput, whereas it is 100% on bare metal. You are stressing the scheduler pretty hard with this test :) Is your real benchmark also using a huge number of threads? If it's not, you might be seeing a different issue. IOW, the netperf degradation might not be network-related at all, but have to do with speed of context switch in guest. Thoughts? The KVM host has 16 CPUs. The KVM guest is configured with 2 VCPUs. When I run netperf on the host I boot the host with maxcpus=2 on the kernel command line. The host is running the current KVM upstream kernel along with the current upstream qemu. Here is the qemu command used to launch the guest: /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2 -vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize We have tried various proposed fixes, each with varying amounts of success. One such fix was to add code to the vhost thread such that when it found the work queue empty it wouldn't just exit the thread but rather would delay for 50 microseconds and then recheck the queue. If there was work on the queue it would loop back and process it, else it would exit the thread. The change got us a 13% improvement in the DayTrader throughput. Running the same netperf configuration on the same hardware but using a different hypervisor gets us significantly better throughput numbers. The guest on that hypervisor runs at 100% CPU utilization. The various fixes we have tried have not gotten us close to the throughput seen on the other hypervisor. I'm looking for ideas/input from the KVM experts on how to make KVM perform better when handling small packets. Thanks, Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. What kind of coordination? With a patched vhost, and a full ring. you should get an interrupt per 100 packets. Is this what you see? And if yes, isn't the guest patch doing nothing then? vhost_signal won't be able send any TX interrupts to guest when guest TX interrupt is disabled. Guest TX interrupt is only enabled when running out of descriptors. Well, this is also the only case where the queue is stopped, no? We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Which hack was that? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? So delaying just signal seems better, right? I also forced vhost handle_tx to handle more packets; both hack seemed help. Thanks Shirley Haven't noticed that part, how does your patch make it handle more packets? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. What does this mean? Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. I don't understand. It's a couple of writes - what is the issue? So delaying just signal seems better, right? I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley Need to think about this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. What does this mean? Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) The expected number is vq-num - max skb frags - 2. Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. I don't understand. It's a couple of writes - what is the issue? Oh, handle_tx could process more packets per loop for multiple streams TCP_RR case. I need to print out the data rate per loop to confirm this. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) Since I found that the netif queue stop/wake up is so expensive, I created a dropping packets patch on guest side so I don't need to debug generic networking code. guest start_xmit() capacity = free_old_xmit_skb() + virtqueue_get_num_freed() if (capacity == 0) drop this packet; return; In the patch, both guest TX interrupts and callback have been omitted. Host vhost_signal in handle_tx can totally be removed as well. (A new virtio_ring API is needed for exporting total of num_free descriptors here -- virtioqueue_get_num_freed) Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) What is the average packet size, # bytes per ack, and the # of interrupts per packet? It could be that just slowing down trahsmission makes GSO work better. Since large message size (64K) doesn't hit (capacity == 0) case, so the performance only has a little better. (from 13.xGb/s to 14.x Gb/s) kvm_stat output shows significant exits reduction for both VM and I/O, no guest TX interrupts. With dropping packets, TCP retrans has been increased here, so I can see performance numbers are various. This might be not a good solution, but it gave us some ideas on expensive netif queue stop/wake up between guest and host notification. I couldn't find a better solution on how to reduce netif queue stop/wake up rate for small message size. But I think once we can address this, the guest TX performance will burst for small message size. I also compared this with return TX_BUSY approach when (capacity == 0), it is not as good as dropping packets. 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley Need to think about this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote: On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote: Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) The expected number is vq-num - max skb frags - 2. It was various (up to the ring size 256). This is using indirection buffers, it returned how many freed descriptors, not number of buffers. Why do you think it is vq-num - max skb frags - 2 here? Shirley well queue is stopped which happens when if (capacity 2+MAX_SKB_FRAGS) { netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb(vi-svq))) { /* More just got used, free them then recheck. * */ capacity += free_old_xmit_skbs(vi); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } } } This should be the most common case. I guess the case with += free_old_xmit_skbs is what can get us more. But it should be rare. Can you count how common it is? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote: On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote: Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) What is the average packet size, # bytes per ack, and the # of interrupts per packet? It could be that just slowing down trahsmission makes GSO work better. There is no TX interrupts with dropping packet. GSO/TSO is the key for small message performance, w/o GSO/TSO, the performance is limited to about 2Gb/s no matter how big the message size it is. I think any work we try here will increase large packet size rate. BTW for dropping packet, TCP increased fast retrans, not slow start. I will collect tcpdump, netstart before and after data to compare packet size/rate w/o w/i the patch. Thanks Shirley Just a thought: does it help to make tx queue len of the virtio device smaller? E.g. match the vq size? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/13] AMD IOMMU emulation
On Fri, Feb 04, 2011 at 01:32:57AM +0200, Eduard - Gabriel Munteanu wrote: This introduces emulation for the AMD IOMMU, described in AMD I/O Virtualization Technology (IOMMU) Specification. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- Makefile.target |2 +- hw/amd_iommu.c | 694 +++ hw/pc.c |2 + hw/pci_ids.h|2 + hw/pci_regs.h |1 + 5 files changed, 700 insertions(+), 1 deletions(-) create mode 100644 hw/amd_iommu.c diff --git a/Makefile.target b/Makefile.target index e5817ab..4b650bd 100644 --- a/Makefile.target +++ b/Makefile.target @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o dma_rw.o +obj-i386-y += pc_piix.o dma_rw.o amd_iommu.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/amd_iommu.c b/hw/amd_iommu.c new file mode 100644 index 000..6c6346a --- /dev/null +++ b/hw/amd_iommu.c @@ -0,0 +1,694 @@ +/* + * AMD IOMMU emulation + * + * Copyright (c) 2011 Eduard - Gabriel Munteanu + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include pc.h +#include hw.h +#include pci.h +#include qlist.h +#include dma_rw.h + +/* Capability registers */ +#define CAPAB_HEADER0x00 +#define CAPAB_REV_TYPE0x02 +#define CAPAB_FLAGS 0x03 +#define CAPAB_BAR_LOW 0x04 +#define CAPAB_BAR_HIGH 0x08 +#define CAPAB_RANGE 0x0C +#define CAPAB_MISC 0x10 + +#define CAPAB_SIZE 0x14 +#define CAPAB_REG_SIZE 0x04 + +/* Capability header data */ +#define CAPAB_FLAG_IOTLBSUP (1 0) +#define CAPAB_FLAG_HTTUNNEL (1 1) +#define CAPAB_FLAG_NPCACHE (1 2) +#define CAPAB_INIT_REV (1 3) +#define CAPAB_INIT_TYPE 3 +#define CAPAB_INIT_REV_TYPE (CAPAB_REV | CAPAB_TYPE) +#define CAPAB_INIT_FLAGS(CAPAB_FLAG_NPCACHE | CAPAB_FLAG_HTTUNNEL) +#define CAPAB_INIT_MISC (64 15) | (48 8) +#define CAPAB_BAR_MASK ~((1UL 14) - 1) + +/* MMIO registers */ +#define MMIO_DEVICE_TABLE 0x +#define MMIO_COMMAND_BASE 0x0008 +#define MMIO_EVENT_BASE 0x0010 +#define MMIO_CONTROL0x0018 +#define MMIO_EXCL_BASE 0x0020 +#define MMIO_EXCL_LIMIT 0x0028 +#define MMIO_COMMAND_HEAD 0x2000 +#define MMIO_COMMAND_TAIL 0x2008 +#define MMIO_EVENT_HEAD 0x2010 +#define MMIO_EVENT_TAIL 0x2018 +#define MMIO_STATUS 0x2020 + +#define MMIO_SIZE 0x4000 + +#define MMIO_DEVTAB_SIZE_MASK ((1ULL 12) - 1) +#define MMIO_DEVTAB_BASE_MASK (((1ULL 52) - 1) ~MMIO_DEVTAB_SIZE_MASK) +#define MMIO_DEVTAB_ENTRY_SIZE 32 +#define MMIO_DEVTAB_SIZE_UNIT 4096 + +#define MMIO_CMDBUF_SIZE_BYTE (MMIO_COMMAND_BASE + 7) +#define MMIO_CMDBUF_SIZE_MASK 0x0F +#define MMIO_CMDBUF_BASE_MASK MMIO_DEVTAB_BASE_MASK +#define MMIO_CMDBUF_DEFAULT_SIZE8 +#define MMIO_CMDBUF_HEAD_MASK (((1ULL 19) - 1) ~0x0F) +#define MMIO_CMDBUF_TAIL_MASK MMIO_EVTLOG_HEAD_MASK + +#define MMIO_EVTLOG_SIZE_BYTE (MMIO_EVENT_BASE + 7) +#define MMIO_EVTLOG_SIZE_MASK MMIO_CMDBUF_SIZE_MASK +#define MMIO_EVTLOG_BASE_MASK MMIO_CMDBUF_BASE_MASK +#define MMIO_EVTLOG_DEFAULT_SIZEMMIO_CMDBUF_DEFAULT_SIZE +#define MMIO_EVTLOG_HEAD_MASK (((1ULL 19) - 1) ~0x0F) +#define MMIO_EVTLOG_TAIL_MASK MMIO_EVTLOG_HEAD_MASK + +#define MMIO_EXCL_BASE_MASK MMIO_DEVTAB_BASE_MASK +#define MMIO_EXCL_ENABLED_MASK (1ULL 0) +#define MMIO_EXCL_ALLOW_MASK(1ULL 1) +#define
Re: [PATCH 01/13] Generic DMA memory access interface
On Fri, Feb 04, 2011 at 01:32:55AM +0200, Eduard - Gabriel Munteanu wrote: This introduces replacements for memory access functions like cpu_physical_memory_read(). The new interface can handle address translation and access checking through an IOMMU. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- Makefile.target |2 +- hw/dma_rw.c | 124 +++ hw/dma_rw.h | 157 +++ 3 files changed, 282 insertions(+), 1 deletions(-) create mode 100644 hw/dma_rw.c create mode 100644 hw/dma_rw.h diff --git a/Makefile.target b/Makefile.target index e15b1c4..e5817ab 100644 --- a/Makefile.target +++ b/Makefile.target @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o dma_rw.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/dma_rw.c b/hw/dma_rw.c new file mode 100644 index 000..ef8e7f8 --- /dev/null +++ b/hw/dma_rw.c @@ -0,0 +1,124 @@ +/* + * Generic DMA memory access interface. + * + * Copyright (c) 2011 Eduard - Gabriel Munteanu + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include dma_rw.h +#include range.h + +static void dma_register_memory_map(DMADevice *dev, +dma_addr_t addr, +dma_addr_t len, +target_phys_addr_t paddr, +DMAInvalidateMapFunc *invalidate, +void *invalidate_opaque) +{ +DMAMemoryMap *map; + +map = qemu_malloc(sizeof(DMAMemoryMap)); +map-addr = addr; +map-len= len; +map-paddr = paddr; +map-invalidate = invalidate; +map-invalidate_opaque = invalidate_opaque; + +QLIST_INSERT_HEAD(dev-mmu-memory_maps, map, list); +} + +static void dma_unregister_memory_map(DMADevice *dev, + target_phys_addr_t paddr, + dma_addr_t len) +{ +DMAMemoryMap *map; + +QLIST_FOREACH(map, dev-mmu-memory_maps, list) { +if (map-paddr == paddr map-len == len) { +QLIST_REMOVE(map, list); +free(map); +} +} +} + +void dma_invalidate_memory_range(DMADevice *dev, + dma_addr_t addr, + dma_addr_t len) +{ +DMAMemoryMap *map; + +QLIST_FOREACH(map, dev-mmu-memory_maps, list) { +if (ranges_overlap(addr, len, map-addr, map-len)) { +map-invalidate(map-invalidate_opaque); +QLIST_REMOVE(map, list); +free(map); +} +} +} + +void *dma_memory_map(DMADevice *dev, + DMAInvalidateMapFunc *cb, + void *opaque, + dma_addr_t addr, + dma_addr_t *len, + int is_write) +{ +int err; +target_phys_addr_t paddr, plen; + +if (!dev || !dev-mmu) { +return cpu_physical_memory_map(addr, len, is_write); +} + +plen = *len; +err = dev-mmu-translate(dev, addr, paddr, plen, is_write); +if (err) { +return NULL; +} + +/* + * If this is true, the virtual region is contiguous, + * but the translated physical region isn't. We just + * clamp *len, much like cpu_physical_memory_map() does. + */ +if (plen *len) { +*len = plen; +} + +/* We treat maps as remote TLBs to cope with stuff like AIO. */ +if (cb) { +
Re: [PATCH 04/13] ide: use the DMA memory access interface for PCI IDE controllers
On Fri, Feb 04, 2011 at 01:32:58AM +0200, Eduard - Gabriel Munteanu wrote: Emulated PCI IDE controllers now use the memory access interface. This also allows an emulated IOMMU to translate and check accesses. Map invalidation results in cancelling DMA transfers. Since the guest OS can't properly recover the DMA results in case the mapping is changed, this is a fairly good approximation. Note this doesn't handle AHCI emulation yet! Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro How about not changing ahci then, and failing initialization if an mmu is present? --- dma-helpers.c | 23 ++- dma.h |4 +++- hw/ide/ahci.c |3 ++- hw/ide/internal.h |1 + hw/ide/macio.c|4 ++-- hw/ide/pci.c | 18 +++--- 6 files changed, 37 insertions(+), 16 deletions(-) diff --git a/dma-helpers.c b/dma-helpers.c index 712ed89..29a74a4 100644 --- a/dma-helpers.c +++ b/dma-helpers.c @@ -10,12 +10,13 @@ #include dma.h #include block_int.h -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint) +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMADevice *dma) { qsg-sg = qemu_malloc(alloc_hint * sizeof(ScatterGatherEntry)); qsg-nsg = 0; qsg-nalloc = alloc_hint; qsg-size = 0; +qsg-dma = dma; } void qemu_sglist_add(QEMUSGList *qsg, target_phys_addr_t base, @@ -73,12 +74,23 @@ static void dma_bdrv_unmap(DMAAIOCB *dbs) int i; for (i = 0; i dbs-iov.niov; ++i) { -cpu_physical_memory_unmap(dbs-iov.iov[i].iov_base, - dbs-iov.iov[i].iov_len, !dbs-is_write, - dbs-iov.iov[i].iov_len); +dma_memory_unmap(dbs-sg-dma, + dbs-iov.iov[i].iov_base, + dbs-iov.iov[i].iov_len, !dbs-is_write, + dbs-iov.iov[i].iov_len); } } +static void dma_bdrv_cancel(void *opaque) +{ +DMAAIOCB *dbs = opaque; + +bdrv_aio_cancel(dbs-acb); +dma_bdrv_unmap(dbs); +qemu_iovec_destroy(dbs-iov); +qemu_aio_release(dbs); +} + static void dma_bdrv_cb(void *opaque, int ret) { DMAAIOCB *dbs = (DMAAIOCB *)opaque; @@ -100,7 +112,8 @@ static void dma_bdrv_cb(void *opaque, int ret) while (dbs-sg_cur_index dbs-sg-nsg) { cur_addr = dbs-sg-sg[dbs-sg_cur_index].base + dbs-sg_cur_byte; cur_len = dbs-sg-sg[dbs-sg_cur_index].len - dbs-sg_cur_byte; -mem = cpu_physical_memory_map(cur_addr, cur_len, !dbs-is_write); +mem = dma_memory_map(dbs-sg-dma, dma_bdrv_cancel, dbs, + cur_addr, cur_len, !dbs-is_write); if (!mem) break; qemu_iovec_add(dbs-iov, mem, cur_len); diff --git a/dma.h b/dma.h index f3bb275..2417b32 100644 --- a/dma.h +++ b/dma.h @@ -14,6 +14,7 @@ //#include cpu.h #include hw/hw.h #include block.h +#include hw/dma_rw.h typedef struct { target_phys_addr_t base; @@ -25,9 +26,10 @@ typedef struct { int nsg; int nalloc; target_phys_addr_t size; +DMADevice *dma; } QEMUSGList; -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint); +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, DMADevice *dma); void qemu_sglist_add(QEMUSGList *qsg, target_phys_addr_t base, target_phys_addr_t len); void qemu_sglist_destroy(QEMUSGList *qsg); diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c index 968fdce..aea06a9 100644 --- a/hw/ide/ahci.c +++ b/hw/ide/ahci.c @@ -993,7 +993,8 @@ static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist) if (sglist_alloc_hint 0) { AHCI_SG *tbl = (AHCI_SG *)prdt; -qemu_sglist_init(sglist, sglist_alloc_hint); +/* FIXME: pass a proper DMADevice. */ +qemu_sglist_init(sglist, sglist_alloc_hint, NULL); for (i = 0; i sglist_alloc_hint; i++) { /* flags_size is zero-based */ qemu_sglist_add(sglist, le64_to_cpu(tbl[i].addr), diff --git a/hw/ide/internal.h b/hw/ide/internal.h index 697c3b4..3d3d5db 100644 --- a/hw/ide/internal.h +++ b/hw/ide/internal.h @@ -468,6 +468,7 @@ struct IDEDMA { struct iovec iov; QEMUIOVector qiov; BlockDriverAIOCB *aiocb; +DMADevice *dev; }; struct IDEBus { diff --git a/hw/ide/macio.c b/hw/ide/macio.c index c1b4caa..654ae7c 100644 --- a/hw/ide/macio.c +++ b/hw/ide/macio.c @@ -79,7 +79,7 @@ static void pmac_ide_atapi_transfer_cb(void *opaque, int ret) s-io_buffer_size = io-len; -qemu_sglist_init(s-sg, io-len / MACIO_PAGE_SIZE + 1); +qemu_sglist_init(s-sg, io-len / MACIO_PAGE_SIZE + 1, NULL); qemu_sglist_add(s-sg, io-addr, io-len); io-addr += io-len; io-len = 0; @@ -141,7 +141,7 @@ static void pmac_ide_transfer_cb(void *opaque, int ret) s-io_buffer_index = 0; s-io_buffer_size =
Re: [PATCH 01/13] Generic DMA memory access interface
On Fri, Feb 04, 2011 at 01:32:55AM +0200, Eduard - Gabriel Munteanu wrote: This introduces replacements for memory access functions like cpu_physical_memory_read(). The new interface can handle address translation and access checking through an IOMMU. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- Makefile.target |2 +- hw/dma_rw.c | 124 +++ hw/dma_rw.h | 157 +++ 3 files changed, 282 insertions(+), 1 deletions(-) create mode 100644 hw/dma_rw.c create mode 100644 hw/dma_rw.h diff --git a/Makefile.target b/Makefile.target index e15b1c4..e5817ab 100644 --- a/Makefile.target +++ b/Makefile.target @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o dma_rw.o Does this need to be target specific? obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/dma_rw.c b/hw/dma_rw.c new file mode 100644 index 000..ef8e7f8 --- /dev/null +++ b/hw/dma_rw.c @@ -0,0 +1,124 @@ +/* + * Generic DMA memory access interface. + * + * Copyright (c) 2011 Eduard - Gabriel Munteanu + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include dma_rw.h +#include range.h + +static void dma_register_memory_map(DMADevice *dev, +dma_addr_t addr, +dma_addr_t len, +target_phys_addr_t paddr, +DMAInvalidateMapFunc *invalidate, +void *invalidate_opaque) +{ +DMAMemoryMap *map; + +map = qemu_malloc(sizeof(DMAMemoryMap)); +map-addr = addr; +map-len= len; +map-paddr = paddr; +map-invalidate = invalidate; +map-invalidate_opaque = invalidate_opaque; + +QLIST_INSERT_HEAD(dev-mmu-memory_maps, map, list); +} + +static void dma_unregister_memory_map(DMADevice *dev, + target_phys_addr_t paddr, + dma_addr_t len) +{ +DMAMemoryMap *map; + +QLIST_FOREACH(map, dev-mmu-memory_maps, list) { +if (map-paddr == paddr map-len == len) { +QLIST_REMOVE(map, list); +free(map); +} +} +} + +void dma_invalidate_memory_range(DMADevice *dev, + dma_addr_t addr, + dma_addr_t len) +{ +DMAMemoryMap *map; + +QLIST_FOREACH(map, dev-mmu-memory_maps, list) { +if (ranges_overlap(addr, len, map-addr, map-len)) { +map-invalidate(map-invalidate_opaque); +QLIST_REMOVE(map, list); +free(map); +} +} +} + +void *dma_memory_map(DMADevice *dev, + DMAInvalidateMapFunc *cb, + void *opaque, + dma_addr_t addr, + dma_addr_t *len, + int is_write) +{ +int err; +target_phys_addr_t paddr, plen; + +if (!dev || !dev-mmu) { +return cpu_physical_memory_map(addr, len, is_write); +} + +plen = *len; +err = dev-mmu-translate(dev, addr, paddr, plen, is_write); +if (err) { +return NULL; +} + +/* + * If this is true, the virtual region is contiguous, + * but the translated physical region isn't. We just + * clamp *len, much like cpu_physical_memory_map() does. + */ +if (plen *len) { +*len = plen; +} + +/* We treat maps as remote TLBs to cope
Re: [PATCH 2/3] AMD IOMMU support
On Fri, Feb 04, 2011 at 01:24:14AM +0200, Eduard - Gabriel Munteanu wrote: This initializes the AMD IOMMU and creates ACPI tables for it. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- src/acpi.c | 84 src/config.h |3 ++ src/pci_ids.h |1 + src/pci_regs.h |1 + src/pciinit.c | 29 +++ 5 files changed, 118 insertions(+), 0 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 18830dc..fca152c 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -196,6 +196,36 @@ struct srat_memory_affinity u32reserved3[2]; } PACKED; +/* + * IVRS (I/O Virtualization Reporting Structure) table. + * + * Describes the AMD IOMMU, as per: + * AMD I/O Virtualization Technology (IOMMU) Specification, rev 1.26 + */ + +struct ivrs_ivhd +{ +u8type; +u8flags; +u16 length; +u16 devid; +u16 capab_off; +u32 iommu_base_low; +u32 iommu_base_high; +u16 pci_seg_group; +u16 iommu_info; +u32 reserved; +u8entry[0]; +} PACKED; + +struct ivrs_table +{ +ACPI_TABLE_HEADER_DEF/* ACPI common table header. */ +u32iv_info; +u32reserved[2]; +struct ivrs_ivhd ivhd; +} PACKED; + prefix with amd_iommu_ or amd_ then ? #include acpi-dsdt.hex static void @@ -579,6 +609,59 @@ build_srat(void) return srat; } +#define IVRS_SIGNATURE 0x53525649 // IVRS +#define IVRS_MAX_DEVS 32 +static void * +build_ivrs(void) +{ +int iommu_bdf, iommu_cap; +int bdf, max, i; +struct ivrs_table *ivrs; +struct ivrs_ivhd *ivhd; + +/* Note this currently works for a single IOMMU! */ Meant to be a FIXME? How hard is it to fix? Just stick this in a loop? +iommu_bdf = pci_find_class(PCI_CLASS_SYSTEM_IOMMU); +if (iommu_bdf 0) +return NULL; +iommu_cap = pci_find_capability(iommu_bdf, PCI_CAP_ID_SEC); +if (iommu_cap 0) +return NULL; + +ivrs = malloc_high(sizeof(struct ivrs_table) + 4 * IVRS_MAX_DEVS); +ivrs-iv_info = pci_config_readw(iommu_bdf, iommu_cap + 0x12) ~0x000F; + +ivhd = ivrs-ivhd; +ivhd-type = 0x10; +ivhd-flags = 0; +ivhd-length= sizeof(struct ivrs_ivhd); +ivhd-devid = iommu_bdf; +ivhd-capab_off = iommu_cap; +ivhd-iommu_base_low= pci_config_readl(iommu_bdf, iommu_cap + 0x04) + 0xFFFE; +ivhd-iommu_base_high = pci_config_readl(iommu_bdf, iommu_cap + 0x08); +ivhd-pci_seg_group = 0; +ivhd-iommu_info= 0; +ivhd-reserved = 0; + +i = 0; +foreachpci(bdf, max) { +if (bdf == ivhd-devid) +continue; +ivhd-entry[4 * i + 0] = 2; +ivhd-entry[4 * i + 1] = bdf 0xFF; +ivhd-entry[4 * i + 2] = (bdf 8) 0xFF; +ivhd-entry[4 * i + 3] = ~(1 3); +ivhd-length += 4; +if (++i = IVRS_MAX_DEVS) +break; +} + +build_header((void *) ivrs, IVRS_SIGNATURE, + sizeof(struct ivrs_table) + 4 * i, 1); + +return ivrs; +} + static const struct pci_device_id acpi_find_tbl[] = { /* PIIX4 Power Management device. */ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371AB_3, NULL), @@ -625,6 +708,7 @@ acpi_bios_init(void) ACPI_INIT_TABLE(build_madt()); ACPI_INIT_TABLE(build_hpet()); ACPI_INIT_TABLE(build_srat()); +ACPI_INIT_TABLE(build_ivrs()); u16 i, external_tables = qemu_cfg_acpi_additional_tables(); diff --git a/src/config.h b/src/config.h index 6356941..0ba5723 100644 --- a/src/config.h +++ b/src/config.h @@ -172,6 +172,9 @@ #define BUILD_APIC_ADDR 0xfee0 #define BUILD_IOAPIC_ADDR 0xfec0 +#define BUILD_AMD_IOMMU_START 0xfed0 +#define BUILD_AMD_IOMMU_END 0xfee0/* BUILD_APIC_ADDR */ + #define BUILD_SMM_INIT_ADDR 0x38000 #define BUILD_SMM_ADDR0xa8000 #define BUILD_SMM_SIZE0x8000 diff --git a/src/pci_ids.h b/src/pci_ids.h index e1cded2..3cc3c6e 100644 --- a/src/pci_ids.h +++ b/src/pci_ids.h @@ -72,6 +72,7 @@ #define PCI_CLASS_SYSTEM_RTC 0x0803 #define PCI_CLASS_SYSTEM_PCI_HOTPLUG 0x0804 #define PCI_CLASS_SYSTEM_SDHCI 0x0805 +#define PCI_CLASS_SYSTEM_IOMMU 0x0806 #define PCI_CLASS_SYSTEM_OTHER 0x0880 #define PCI_BASE_CLASS_INPUT 0x09 diff --git a/src/pci_regs.h b/src/pci_regs.h index e5effd4..bfac824 100644 --- a/src/pci_regs.h +++ b/src/pci_regs.h @@ -208,6 +208,7 @@ #define PCI_CAP_ID_SHPC 0x0C/* PCI Standard Hot-Plug Controller */ #define PCI_CAP_ID_SSVID0x0D/* Bridge subsystem vendor/device ID */ #define PCI_CAP_ID_AGP3 0x0E/* AGP Target
Re: [PATCH 2/3] AMD IOMMU support
On Sun, Feb 06, 2011 at 03:41:45PM +0200, Eduard - Gabriel Munteanu wrote: On Sun, Feb 06, 2011 at 01:47:57PM +0200, Michael S. Tsirkin wrote: On Fri, Feb 04, 2011 at 01:24:14AM +0200, Eduard - Gabriel Munteanu wrote: Hi, [snip] +/* + * IVRS (I/O Virtualization Reporting Structure) table. + * + * Describes the AMD IOMMU, as per: + * AMD I/O Virtualization Technology (IOMMU) Specification, rev 1.26 + */ + +struct ivrs_ivhd +{ +u8type; +u8flags; +u16 length; +u16 devid; +u16 capab_off; +u32 iommu_base_low; +u32 iommu_base_high; +u16 pci_seg_group; +u16 iommu_info; +u32 reserved; +u8entry[0]; +} PACKED; + +struct ivrs_table +{ +ACPI_TABLE_HEADER_DEF/* ACPI common table header. */ +u32iv_info; +u32reserved[2]; +struct ivrs_ivhd ivhd; +} PACKED; + prefix with amd_iommu_ or amd_ then ? This should be standard nomenclature already, even if IVRS is AMD IOMMU-specific. Yes but the specific structure is amd specific, isn't it? #include acpi-dsdt.hex static void @@ -579,6 +609,59 @@ build_srat(void) return srat; } +#define IVRS_SIGNATURE 0x53525649 // IVRS +#define IVRS_MAX_DEVS 32 +static void * +build_ivrs(void) +{ +int iommu_bdf, iommu_cap; +int bdf, max, i; +struct ivrs_table *ivrs; +struct ivrs_ivhd *ivhd; + +/* Note this currently works for a single IOMMU! */ Meant to be a FIXME? How hard is it to fix? Just stick this in a loop? I suspect a real BIOS would have these values hardcoded anyway, according to the topology of the PCI bus and which IOMMUs sit where. Which values exactly? You already mentioned the possibility of multiple IOMMU capabilities in the same function/bus, in which case there's probably no easy way to guess it from SeaBIOS. It's easy enough to enumerate capabilities and pci devices, isn't it? [snip] +static void amd_iommu_init(u16 bdf, void *arg) +{ +int cap; +u32 base_addr; + +cap = pci_find_capability(bdf, PCI_CAP_ID_SEC); +if (cap 0) { +return; +} There actually can be multiple instances of this capability according to spec. Do we care? Hm, perhaps we should at least assign a base address there, that's easy. As for QEMU/KVM usage we probably don't need it. I expect assigning multiple domains will be useful. I'm guessing multiple devices is what systems have in this case? If so I'm not really sure why is there need for multiple iommu capabilities per device. + +if (amd_iommu_addr = BUILD_AMD_IOMMU_END) { +return; +} +base_addr = amd_iommu_addr; +amd_iommu_addr += 0x4000; + +pci_config_writel(bdf, cap + 0x0C, 0); +pci_config_writel(bdf, cap + 0x08, 0); +pci_config_writel(bdf, cap + 0x04, base_addr | 1); +} + static const struct pci_device_id pci_class_tbl[] = { /* STORAGE IDE */ PCI_DEVICE_CLASS(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371SB_1, @@ -279,6 +302,10 @@ static const struct pci_device_id pci_class_tbl[] = { PCI_DEVICE_CLASS(PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, pci_bios_init_device_bridge), +/* AMD IOMMU */ Makes sense to limit to AMD vendor id? I don't think so, I assume any PCI_CLASS_SYSTEM_IOMMU device would implement the same specification, considering these ids have been assigned by PCI-SIG. This hasn't been the case in the past, e.g. with PCI_CLASS_NETWORK_ETHERNET, so I see no reason to assume it here. +PCI_DEVICE_CLASS(PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_SYSTEM_IOMMU, + amd_iommu_init), + /* default */ PCI_DEVICE(PCI_ANY_ID, PCI_ANY_ID, pci_bios_allocate_regions), @@ -408,6 +435,8 @@ pci_setup(void) pci_region_init(pci_bios_prefmem_region, BUILD_PCIPREFMEM_START, BUILD_PCIPREFMEM_END - 1); +amd_iommu_addr = BUILD_AMD_IOMMU_START; + pci_bios_init_bus(); int bdf, max; -- 1.7.3.4 Thanks for your review, I read your other comments and will resubmit once I fix those issues. Eduard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -7,6 +7,14 @@ config VIRTIO_RING tristate depends on VIRTIO +config VIRTIO_STATS + bool Virtio debugging stats (EXPERIMENTAL) + depends on VIRTIO_RING + select DEBUG_FS + ---help--- + Virtio stats collected by how full the ring is at any time, + presented under debugfs/virtio/name-vq/num-used/ + config VIRTIO_PCI tristate PCI driver for virtio devices (EXPERIMENTAL) depends on PCI EXPERIMENTAL diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -21,6 +21,7 @@ #include linux/virtio_config.h #include linux/device.h #include linux/slab.h +#include linux/debugfs.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -95,6 +96,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); +#ifdef CONFIG_VIRTIO_STATS + struct vring_stat *stats; + struct dentry *statdir; +#endif + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -106,6 +112,87 @@ struct vring_virtqueue #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +#ifdef CONFIG_VIRTIO_STATS +/* We have an array of these, indexed by how full the ring is. */ +struct vring_stat { + /* How many interrupts? */ + size_t interrupt_nowork, interrupt_work; + /* How many non-notify kicks, how many notify kicks, how many add notify? */ + size_t kick_no_notify, kick_notify, add_notify; + /* How many adds? */ + size_t add_direct, add_indirect, add_fail; + /* How many gets? */ + size_t get; + /* How many disable callbacks? */ + size_t disable_cb; + /* How many enables? */ + size_t enable_cb_retry, enable_cb_success; +}; + +static struct dentry *virtio_stats; + +static void create_stat_files(struct vring_virtqueue *vq) +{ + char name[80]; + unsigned int i; + + /* Racy in theory, but we don't care. */ + if (!virtio_stats) + virtio_stats = debugfs_create_dir(virtio-stats, NULL); + + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name); + vq-statdir = debugfs_create_dir(name, virtio_stats); + + for (i = 0; i vq-vring.num; i++) { + struct dentry *dir; + + sprintf(name, %i, i); + dir = debugfs_create_dir(name, vq-statdir); + debugfs_create_size_t(interrupt_nowork, 0400, dir, + vq-stats[i].interrupt_nowork); + debugfs_create_size_t(interrupt_work, 0400, dir, + vq-stats[i].interrupt_work); + debugfs_create_size_t(kick_no_notify, 0400, dir, + vq-stats[i].kick_no_notify); + debugfs_create_size_t(kick_notify, 0400, dir, + vq-stats[i].kick_notify); + debugfs_create_size_t(add_notify, 0400, dir, + vq-stats[i
Re: Network performance with small packets
On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote: On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. Sure, I'll look into this. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 0/5] macvtap TX zero copy between guest and host kernel
On Fri, Dec 10, 2010 at 01:51:31AM -0800, Shirley Ma wrote: This patchset add supports for TX zero-copy between guest and host kernel through vhost. It significantly reduces CPU utilization on the local host on which the guest is located (It reduced 30-50% CPU usage for vhost thread for single stream test). The patchset is based on previous submission and comments from the community regarding when/how to handle guest kernel buffers to be released. This is the simplest approach I can think of after comparing with several other solutions. This patchset includes: 1. Induce a new sock zero-copy flag, SOCK_ZEROCOPY; 2. Induce a new device flag, NETIF_F_ZEROCOPY for device can support zero-copy; 3. Add a new struct skb_ubuf_info in skb_share_info for userspace buffers release callback when device DMA has done for that skb; 4. Add vhost zero-copy callback in vhost when skb last refcnt is gone; add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb buffers. 5. Add macvtap zero-copy in lower device when sending packet is greater than 128 bytes. The patchset has passed netperf/netserver test on Chelsio, and continuing test on other 10GbE NICs, like Intel ixgbe, Mellanox mlx4... I will provide guest to host, host to guest performance data next week. However when running stress test, vhost virtio_net seems out of sync, and virito_net interrupt was disabled somehow, and it stopped to send any packet. This problem has bothered me for a long long time, I will continue to look at this. Please review this. Thanks Shirley What's the status here? Since there are core net changes, we'll need to see the final version soon if it's to appear in 2.6.39. Could the problem be related to the patch virtio_net: Add schedule check to napi_enable call ? Also, I expect there should be driver patches for some devices? Where are they? Thanks, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio_net sometimes didn't work
On Wed, Feb 16, 2011 at 09:53:25AM +0800, lidong chen wrote: because of some other work, i could not focus on this problem last month. now i find the wrong vaule of isr cause this problem. in function vp_interrupt, the isr is 0, and the virtio_balloon pci device have 10 times initerrupts. then IRQ #11 disabled. static irqreturn_t vp_interrupt(int irq, void *opaque) { struct virtio_pci_device *vp_dev = opaque; struct virtio_pci_vq_info *info; irqreturn_t ret = IRQ_NONE; unsigned long flags; u8 isr; /* reading the ISR has the effect of also clearing it so it's very * important to save off the value. */ isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR); /* It's definitely not us if the ISR was not high */ if (!isr) return IRQ_NONE; //return from here This implies that io addr values got swapped between the devices. Try lspci -vv in guest and info pci in qemu and compare the io address values. 2010/12/9 Michael S. Tsirkin m...@redhat.com: On Fri, Nov 26, 2010 at 10:38:33AM +0800, lidong chen wrote: Does this message appear on boot, or after some stress? on boot, and only appear when boot from network. Which qemu-kvm version? [root@kvm-4slot ~]# /usr/libexec/qemu-kvm --version QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2), Copyright (c) 2003-2008 Fabrice Bellard what happens with the latest qemu? Does cherry-picking 3fff0179e33cd7d0a688dab65700c46ad089e934 help? the virtio_pci have already used this patch, still have this problem. What does info irqs show in qemu? how to collect this information? but I found if modify the slot number of balloon device from 0x03 to 0x09, the problem solved. memballoon model='virtio' alias name='balloon0'/ address type='pci' domain='0x' bus='0x00' slot='0x09' function='0x0'/ /memballoon Interesting. Is it possible that even after baloon is moved, there's still the message in guest, only this time things keep going afterwards? and i found someone else also meet this problem. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/584675 2010/11/25 Michael S. Tsirkin m...@redhat.com: On Thu, Nov 25, 2010 at 10:21:24PM +0800, lidong chen wrote: [version] the host os version is 2.6.32 Which qemu-kvm version? the guest os version is 2.6.16 [dmesg] ACPI: (supports S3 S4 S5) Freeing unused kernel memory: 200k freed input: ImExPS/2 Generic Explorer Mouse as /class/input/input2 ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11 ACPI: PCI Interrupt :00:03.0[A] - Link [LNKC] - GSI 11 (level, high) - IRQ 11 io address 0001c040ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10 ACPI: PCI Interrupt :00:04.0[A] - Link [LNKD] - GSI 10 (level, high) - IRQ 10 io address 0001c060ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10 ACPI: PCI Interrupt :00:05.0[A] - Link [LNKA] - GSI 10 (level, high) - IRQ 10 io address 0001c080ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11 ACPI: PCI Interrupt :00:06.0[A] - Link [LNKB] - GSI 11 (level, high) - IRQ 11 io address 0001c0a06ACPI: PCI Interrupt :00:07.0[A] - Link [LNKC] - GSI 11 (level, high) - IRQ 11 io address 0001c0c06ACPI: PCI Interrupt :00:08.0[A] - Link [LNKD] - GSI 10 (level, high) - IRQ 10 io address 0001c0e0 irq 11: nobody cared (try booting with the irqpoll option) [c01457b0] __report_bad_irq+0x2b/0x69 [c0145979] note_interrupt+0x18b/0x1b2 [c01452a9] handle_IRQ_event+0x26/0x51 [c014537f] __do_IRQ+0xab/0xdc [c0106445] do_IRQ+0x46/0x53 [c0104e8a] common_interrupt+0x1a/0x20 [c01276f2] __do_softirq+0x4f/0xc2 [c0127793] do_softirq+0x2e/0x32 [c0104f3c] apic_timer_interrupt+0x1c/0x30 [c0102d55] default_idle+0x2e/0x5c [c0102e14] cpu_idle+0x91/0xad [c03946e5] start_kernel+0x34c/0x353 handlers: [f88252ee] (vp_interrupt+0x0/0x3e [virtio_pci]) Disabling IRQ #11 Does this message appear on boot, or after some stress? Does cherry-picking 3fff0179e33cd7d0a688dab65700c46ad089e934 help? Happens with a newer kernel as guest? What does info irqs show in qemu? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio_net sometimes didn't work
- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Region 0: Memory at f000 (32-bit, prefetchable) [size=32M] Region 1: Memory at f200 (32-bit, non-prefetchable) [size=4K] Expansion ROM at f201 [disabled] [size=64K] 00:03.0 RAM memory: Unknown device 1af4:1002 Subsystem: Unknown device 1af4:0005 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at c040 [size=32] 00:04.0 Ethernet controller: Unknown device 1af4:1000 Subsystem: Unknown device 1af4:0001 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c060 [size=32] Region 1: Memory at f202 (32-bit, non-prefetchable) [size=4K] Expansion ROM at f203 [disabled] [size=64K] Capabilities: [40] MSI-X: Enable- Mask- TabSize=3 Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 00:05.0 Ethernet controller: Unknown device 1af4:1000 Subsystem: Unknown device 1af4:0001 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c080 [size=32] Region 1: Memory at f204 (32-bit, non-prefetchable) [size=4K] Expansion ROM at f205 [disabled] [size=64K] Capabilities: [40] MSI-X: Enable- Mask- TabSize=3 Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 00:06.0 Ethernet controller: Unknown device 1af4:1000 Subsystem: Unknown device 1af4:0001 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at c0a0 [size=32] Region 1: Memory at f206 (32-bit, non-prefetchable) [size=4K] Expansion ROM at f207 [disabled] [size=64K] Capabilities: [40] MSI-X: Enable- Mask- TabSize=3 Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 00:07.0 Ethernet controller: Unknown device 1af4:1000 Subsystem: Unknown device 1af4:0001 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at c0c0 [size=32] Region 1: Memory at f208 (32-bit, non-prefetchable) [size=4K] Expansion ROM at f209 [disabled] [size=64K] Capabilities: [40] MSI-X: Enable- Mask- TabSize=3 Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 00:08.0 Ethernet controller: Unknown device 1af4:1000 Subsystem: Unknown device 1af4:0001 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at c0e0 [size=32] Region 1: Memory at f20a (32-bit, non-prefetchable) [size=4K] Expansion ROM at f20b [disabled] [size=64K] Capabilities: [40] MSI-X: Enable- Mask- TabSize=3 Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 2011/2/16 Michael S. Tsirkin m...@redhat.com: On Wed, Feb 16, 2011 at 09:53:25AM +0800, lidong chen wrote: because of some other work, i could not focus on this problem last month. now i find the wrong vaule of isr cause this problem. in function vp_interrupt, the isr is 0, and the virtio_balloon pci device have 10 times initerrupts. then IRQ #11 disabled. static irqreturn_t vp_interrupt(int irq, void *opaque) { struct virtio_pci_device *vp_dev = opaque; struct virtio_pci_vq_info *info; irqreturn_t ret = IRQ_NONE; unsigned long flags; u8 isr; /* reading the ISR has the effect of also clearing it so it's very * important to save off the value. */ isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR); /* It's definitely not us if the ISR was not high */ if (!isr) return IRQ_NONE; //return from here This implies that io addr values got swapped between the devices. Try lspci -vv in guest
KVM call minutes for Feb 22
Looks like Chris will send minutes too, so I didn't do much to polish this, I didn't realise he's doing it until I had this, so here's the braindump: hope it helps. 1. 0.14 postmortem - what went well wiki for planning testing - what can be improved rc - cycle could be longer 2. 0.15 should be end of June July 1? - features: QED other? virtagent? might be blocked by the rpc issue (discussion followed) 3. virtagent rpc what should virtagent use to talk to the world outside of QEMU? Generalize QMP? Use an external rpc library? Do we want to/can we have virtagent out of process? Hard to make well with live migration. Would be the most beneficial if we can make it stateless. Might be good for security but hard to integrate. 4. qcow3 there will be a wiki with features we want there 5. google summer of code we only have 3 projects need more projects/mentors, we have another week 6. switch from gpxe to ipxe needed to get patches applied we'll likely switch need some testing might be possible by 0.15 -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote: Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM: Hi Simon, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? Yes - mainline kernel with transmit-only MQ patch. 2. Was UDP or TCP used? TCP. I had done some initial testing on UDP, but don't have the results now as it is really old. But I will be running it again. 3. What was the transmit size (-m option to netperf)? I didn't use the -m option, so it defaults to 16K. The script does: netperf -t TCP_STREAM -c -C -l 60 -H $SERVER Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Remote testing is still to be done. Others might be able to help here once you post the patch. Thanks, - KK Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%, CPU/RCPU: 66.3%,41.6%, SD/RSD: 11.5%,-3.7% __ numtxqs=8, vhosts=5 #sessions BW% CPU% RCPU% SD% RSD% 1 -.76-1.56 2.33 03.03 2 17.4111.1111.41 0 -4.76 4 42.1255.1130.20 19.51.62 8 54.6980.0039.22 24.39-3.88 16 54.7781.6240.89 20.34-6.58 24 54.6679.6841.57 15.49-8.99 32 54.9276.8241.79 17.59-5.70 40
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote: On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Looks pretty good overall. A few comments below. It seems like we should be able to hook this into vfio with a small stub in kvm. We just need to be able to communicate disabling and enabling of individual msix vectors. For brute force, we could do this with a lot of eventfds, triggered by kvm and consumed by vfio, two per MSI-X vector. Not sure if there's something smaller that could do it. Thanks, Alex, thanks for your comments. See my comments below: Alex Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/Makefile|2 +- arch/x86/kvm/x86.c |8 +- include/linux/kvm.h | 21 include/linux/kvm_host.h | 25 virt/kvm/assigned-dev.c | 44 +++ virt/kvm/kvm_main.c | 38 ++- virt/kvm/msix_mmio.c | 286 ++ virt/kvm/msix_mmio.h | 25 8 files changed, 442 insertions(+), 7 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..89bf12c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3822,14 +3824,16 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!r) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-exit_reason = (r == -ENOTSYNC) ? + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO; This use of -ENOTSYNC is IMO confusing. How about we make vcpu_mmio_write return the positive exit reason? Negative value will mean an error. vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; vcpu-run-mmio.len = vcpu-mmio_size = bytes; vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; diff --git a/include/linux/kvm.h b/include/linux/kvm.h index ea2dc1a..ad9df4b 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -161,6 +161,7 @@ struct kvm_pit_config { #define KVM_EXIT_NMI 16 #define KVM_EXIT_INTERNAL_ERROR 17 #define KVM_EXIT_OSI 18 +#define KVM_EXIT_MSIX_ROUTING_UPDATE 19 /* For KVM_EXIT_INTERNAL_ERROR */ #define KVM_INTERNAL_ERROR_EMULATION 1 @@ -541,6 +542,7 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 #define KVM_CAP_ASYNC_PF 59 +#define KVM_CAP_MSIX_MMIO 60 #ifdef KVM_CAP_IRQ_ROUTING @@ -672,6 +674,9 @@ struct kvm_clock_data { #define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct kvm_xen_hvm_config) #define KVM_SET_CLOCK _IOW(KVMIO, 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK _IOR(KVMIO, 0x7c, struct kvm_clock_data) +/* Available with KVM_CAP_MSIX_MMIO */ +#define KVM_REGISTER_MSIX_MMIO_IOW(KVMIO, 0x7d, struct kvm_msix_mmio_user) +#define KVM_UNREGISTER_MSIX_MMIO _IOW(KVMIO, 0x7e, struct kvm_msix_mmio_user) /* Available with KVM_CAP_PIT_STATE2 */ #define KVM_GET_PIT2 _IOR(KVMIO, 0x9f, struct kvm_pit_state2) #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM: Hi Michael, Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Cool. I was wondering whether in that case, we can do without host kernel changes at all, and use a separate fd for each TX/RX pair. The advantage of that approach is that this way, the max fd limit naturally sets an upper bound on the amount of resources userspace can use up. Thoughts? In any case, pls don't let the above delay sending an RFC. Remote testing is still to be done. Others might be able to help here once you post the patch. That's great, will appreciate any help. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Wed, Feb 23, 2011 at 09:34:19AM -0700, Alex Williamson wrote: On Wed, 2011-02-23 at 14:59 +0800, Sheng Yang wrote: On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote: On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote: +static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, + void *val) +{ + struct kvm_msix_mmio_dev *mmio_dev = + container_of(this, struct kvm_msix_mmio_dev, table_dev); + struct kvm_msix_mmio *mmio; + int idx, ret = 0, entry, offset, r; + + mutex_lock(mmio_dev-lock); + idx = get_mmio_table_index(mmio_dev, addr, len); + if (idx 0) { + ret = -EOPNOTSUPP; + goto out; + } + if ((addr 0x3) || (len != 4 len != 8)) + goto out; + + offset = addr 0xf; + if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL len == 8) + goto out; + + mmio = mmio_dev-mmio[idx]; + entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE; + r = copy_from_user(val, (void __user *)(mmio-table_base_va + + entry * PCI_MSIX_ENTRY_SIZE + offset), len); + if (r) + goto out; +out: + mutex_unlock(mmio_dev-lock); + return ret; +} + +static int msix_table_mmio_write(struct kvm_io_device *this, gpa_t addr, + int len, const void *val) +{ + struct kvm_msix_mmio_dev *mmio_dev = + container_of(this, struct kvm_msix_mmio_dev, table_dev); + struct kvm_msix_mmio *mmio; + int idx, entry, offset, ret = 0, r = 0; + gpa_t entry_base; + u32 old_ctrl, new_ctrl; + u32 *ctrl_pos; + + mutex_lock(mmio_dev-kvm-lock); + mutex_lock(mmio_dev-lock); + idx = get_mmio_table_index(mmio_dev, addr, len); + if (idx 0) { + ret = -EOPNOTSUPP; + goto out; + } + if ((addr 0x3) || (len != 4 len != 8)) + goto out; + + offset = addr 0xF; nit, this 0xf above. So you like another mask? I'm just noting that above you used 0xf and here you used 0xF. + if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL len == 8) + goto out; + + mmio = mmio_dev-mmio[idx]; + entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE; + entry_base = mmio-table_base_va + entry * PCI_MSIX_ENTRY_SIZE; + ctrl_pos = (u32 *)(entry_base + PCI_MSIX_ENTRY_VECTOR_CTRL); + + if (get_user(old_ctrl, ctrl_pos)) + goto out; + + /* No allow writing to other fields when entry is unmasked */ + if (!(old_ctrl PCI_MSIX_ENTRY_CTRL_MASKBIT) + offset != PCI_MSIX_ENTRY_VECTOR_CTRL) + goto out; + + if (copy_to_user((void __user *)(entry_base + offset), val, len)) + goto out; + + if (get_user(new_ctrl, ctrl_pos)) + goto out; + + if ((offset PCI_MSIX_ENTRY_VECTOR_CTRL len == 4) || + (offset PCI_MSIX_ENTRY_DATA len == 8)) + ret = -ENOTSYNC; Couldn't we avoid the above get_user(new_ctrl,...) if we tested this first? I'd also prefer to see a direct goto out here since it's not entirely obvious that this first 'if' always falls into the below 'if'. I'm not a fan of using an error code as a special non-error return either. In fact when offset==PCI_MSIX_ENTRY_DATA and len ==8, we need to check new_ctrl to see if they indeed modified ctrl bits. I would try to make the logic here more clear. Isn't that what you're testing for though? (offset PCI_MSIX_ENTRY_VECTOR_CTRL len == 4) If this is true, we can only be writing address, upper address, or data, not control. (offset PCI_MSIX_ENTRY_DATA len == 8) If this is true, we're writing address and upper address, not the control vector. Additionally, I think the size an alignment test should be more restrictive. We really only want to allow naturally aligned 4 or 8 byte accesses. Maybe instead of: if ((addr 0x3) || (len != 4 len != 8)) we should do this: if (!(len == 4 || len == 8) || addr (len - 1)) Then the test could become: if (offset + len = PCI_MSIX_ENTRY_VECTOR_CTRL) Thanks, Alex Or rather if (offset + len != PCI_MSIX_ENTRY_VECTOR_CTRL + 4) we are matching offset + length to control offset + control length, and avoid depending on the fact control is the last register. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [ISC-Bugs #22806] [PATCH] lfp: fix AF_INET checksum with csum offloading
No answer from isc so far. I think it's a problem that the popular dhclient has had this bug for so long. Anyone knows what else can be done? On Mon, Dec 27, 2010 at 07:16:32PM +0200, Michael S. Tsirkin wrote: When an AF_INET socket is used, linux would sometimes return a packet without the checksum. This happens when a packet originates on the same machine, which is most common with virtual machines but might be possible even without with a software-based dhcp server such as dnsmasq. This appears to be a performance optimization to avoid calculating the checksum and there's no way to disable this. Users are required to detect such packets by passing a msg_control parameter to the recvmsg call. This was added in linux kernel 2.6.21. dhclient from isc.org as of 4.2.0-P2 fails to do this and concequently discards such packets, so such setups fail to get an IP. This was reported in the past: https://lists.isc.org/mailman/htdig/dhcp-hackers/2010-April/001825.html The attached patch fixes this by passing msg_control and looking at the tp_status field in the result. If the TP_STATUS_CSUMNOTREADY bit is set, the packet checksum is missing because the packet is local. The patch below is currently successfully used at least on Fedora and some other Red Hat based distributions. Applying this to the ISC sources would help make the problem go away for everyone. --- Notes: - The patch is to be applied with -p1. - The patch below applies to dhcp 4.2.0-P2 and 4.1.2 (the last with an offset). - Please Cc me on comments directly, I am not subscribed to the dhcp-hackers list. common/bpf.c |2 +- common/dlpi.c|2 +- common/lpf.c | 81 ++ common/nit.c |2 +- common/packet.c |4 +- common/upf.c |2 +- includes/dhcpd.h |2 +- 7 files changed, 70 insertions(+), 25 deletions(-) diff --git a/common/bpf.c b/common/bpf.c index b0ef657..8bd5727 100644 --- a/common/bpf.c +++ b/common/bpf.c @@ -485,7 +485,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom) offset = decode_udp_ip_header (interface, interface - rbuf, interface - rbuf_offset, -from, hdr.bh_caplen, paylen); +from, hdr.bh_caplen, paylen, 0); /* If the IP or UDP checksum was bad, skip the packet... */ if (offset 0) { diff --git a/common/dlpi.c b/common/dlpi.c index eb64342..d4a8bb9 100644 --- a/common/dlpi.c +++ b/common/dlpi.c @@ -694,7 +694,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom) length -= offset; #endif offset = decode_udp_ip_header (interface, dbuf, bufix, -from, length, paylen); +from, length, paylen, 0); /* * If the IP or UDP checksum was bad, skip the packet... diff --git a/common/lpf.c b/common/lpf.c index f727b7c..4bdb0f1 100644 --- a/common/lpf.c +++ b/common/lpf.c @@ -29,18 +29,33 @@ #include dhcpd.h #if defined (USE_LPF_SEND) || defined (USE_LPF_RECEIVE) #include sys/ioctl.h +#include sys/socket.h #include sys/uio.h #include errno.h #include asm/types.h #include linux/filter.h #include linux/if_ether.h +#include linux/if_packet.h #include netinet/in_systm.h #include includes/netinet/ip.h #include includes/netinet/udp.h #include includes/netinet/if_ether.h #include net/if.h +#ifndef PACKET_AUXDATA +#define PACKET_AUXDATA 8 + +struct tpacket_auxdata +{ + __u32 tp_status; + __u32 tp_len; + __u32 tp_snaplen; + __u16 tp_mac; + __u16 tp_net; +}; +#endif + /* Reinitializes the specified interface after an address change. This is not required for packet-filter APIs. */ @@ -66,10 +81,14 @@ int if_register_lpf (info) struct interface_info *info; { int sock; - struct sockaddr sa; + union { + struct sockaddr_ll ll; + struct sockaddr common; + } sa; + struct ifreq ifr; /* Make an LPF socket. */ - if ((sock = socket(PF_PACKET, SOCK_PACKET, + if ((sock = socket(PF_PACKET, SOCK_RAW, htons((short)ETH_P_ALL))) 0) { if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT || errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT || @@ -84,11 +103,16 @@ int if_register_lpf (info) log_fatal (Open a socket for LPF: %m); } + memset (ifr, 0, sizeof ifr); + strncpy (ifr.ifr_name, (const char *)info - ifp, sizeof ifr.ifr_name); + if (ioctl (sock, SIOCGIFINDEX, ifr)) + log_fatal (Failed to get interface index: %m); + /* Bind
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Thu, Feb 24, 2011 at 04:08:22PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote: On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote: On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Looks pretty good overall. A few comments below. It seems like we should be able to hook this into vfio with a small stub in kvm. We just need to be able to communicate disabling and enabling of individual msix vectors. For brute force, we could do this with a lot of eventfds, triggered by kvm and consumed by vfio, two per MSI-X vector. Not sure if there's something smaller that could do it. Thanks, Alex, thanks for your comments. See my comments below: Alex Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/Makefile|2 +- arch/x86/kvm/x86.c |8 +- include/linux/kvm.h | 21 include/linux/kvm_host.h | 25 virt/kvm/assigned-dev.c | 44 +++ virt/kvm/kvm_main.c | 38 ++- virt/kvm/msix_mmio.c | 286 ++ virt/kvm/msix_mmio.h | 25 8 files changed, 442 insertions(+), 7 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..89bf12c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3822,14 +3824,16 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!r) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-exit_reason = (r == -ENOTSYNC) ? + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO; This use of -ENOTSYNC is IMO confusing. How about we make vcpu_mmio_write return the positive exit reason? Negative value will mean an error. Make sense. I would update it. vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; vcpu-run-mmio.len = vcpu-mmio_size = bytes; vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; diff --git a/include/linux/kvm.h b/include/linux/kvm.h index ea2dc1a..ad9df4b 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -161,6 +161,7 @@ struct kvm_pit_config { #define KVM_EXIT_NMI 16 #define KVM_EXIT_INTERNAL_ERROR 17 #define KVM_EXIT_OSI 18 +#define KVM_EXIT_MSIX_ROUTING_UPDATE 19 /* For KVM_EXIT_INTERNAL_ERROR */ #define KVM_INTERNAL_ERROR_EMULATION 1 @@ -541,6 +542,7 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 #define KVM_CAP_ASYNC_PF 59 +#define KVM_CAP_MSIX_MMIO 60 #ifdef KVM_CAP_IRQ_ROUTING @@ -672,6 +674,9 @@ struct kvm_clock_data { #define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct kvm_xen_hvm_config
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Thu, Feb 24, 2011 at 05:44:20PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote: On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote: On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Looks pretty good overall. A few comments below. It seems like we should be able to hook this into vfio with a small stub in kvm. We just need to be able to communicate disabling and enabling of individual msix vectors. For brute force, we could do this with a lot of eventfds, triggered by kvm and consumed by vfio, two per MSI-X vector. Not sure if there's something smaller that could do it. Thanks, Alex, thanks for your comments. See my comments below: Alex Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/Makefile|2 +- arch/x86/kvm/x86.c |8 +- include/linux/kvm.h | 21 include/linux/kvm_host.h | 25 virt/kvm/assigned-dev.c | 44 +++ virt/kvm/kvm_main.c | 38 ++- virt/kvm/msix_mmio.c | 286 ++ virt/kvm/msix_mmio.h | 25 8 files changed, 442 insertions(+), 7 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..89bf12c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3822,14 +3824,16 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!r) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-exit_reason = (r == -ENOTSYNC) ? + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO; This use of -ENOTSYNC is IMO confusing. How about we make vcpu_mmio_write return the positive exit reason? Negative value will mean an error. In fact currently nagative value means something more need to be done, the same as MMIO exit. So it would be if (!r) return X86EMUL_CONTINUE; vcpu-run-exit_reason = r; Now I think we can keep it, or update them all later. The way to do this would be 1. patch to return KVM_EXIT_MMIO on mmio 2. your patch that returns KVM_EXIT_MSIX_ROUTING_UPDATE on top -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler
On Thu, Feb 24, 2011 at 05:51:03PM +0800, Sheng Yang wrote: Add a new parameter to IO writing handler, so that we can transfer information from IO handler to caller. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/i8254.c |6 -- arch/x86/kvm/i8259.c |3 ++- arch/x86/kvm/lapic.c |3 ++- arch/x86/kvm/x86.c| 13 - include/linux/kvm_host.h | 12 ++-- virt/kvm/coalesced_mmio.c |3 ++- virt/kvm/eventfd.c|2 +- virt/kvm/ioapic.c |2 +- virt/kvm/iodev.h |6 -- virt/kvm/kvm_main.c |4 ++-- 10 files changed, 36 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index efad723..bd8f0c5 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr) } static int pit_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = dev_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device *this, } static int speaker_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = speaker_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 3cece05..96b1070 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct kvm_io_device *dev) } static int picdev_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *val) + gpa_t addr, int len, const void *val, + struct kvm_io_ext_data *ext_data) { struct kvm_pic *s = to_pic(this); unsigned char data = *(unsigned char *)val; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 93cf9d0..f413e9c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) } static int apic_mmio_write(struct kvm_io_device *this, - gpa_t address, int len, const void *data) + gpa_t address, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_lapic *apic = to_lapic(this); unsigned int offset = address - apic-base_address; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..21b84e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void) } static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, -const void *v) +const void *v, struct kvm_io_ext_data *ext_data) { if (vcpu-arch.apic - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v)) + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data)) return 0; - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v); + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, + addr, len, v, ext_data); } static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + struct kvm_io_ext_data ext_data; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3825,7 +3827,7 @@ mmio: /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; @@ -3940,6 +3942,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) { /* TODO: String I/O for in kernel device */ int r; + struct kvm_io_ext_data ext_data; if (vcpu-arch.pio.in) r = kvm_io_bus_read(vcpu-kvm, KVM_PIO_BUS, vcpu-arch.pio.port, @@ -3947,7 +3950,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) else r = kvm_io_bus_write(vcpu-kvm, KVM_PIO_BUS, vcpu-arch.pio.port, vcpu-arch.pio.size, - pd); + pd, ext_data); return r; } diff
Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel
On Thu, Feb 24, 2011 at 05:51:04PM +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Signed-off-by: Sheng Yang sh...@linux.intel.com Doesn't look like all comments got addressed. E.g. gpa_t entry_base is still there and in reality you said it's a host virtual address so should be void __user *; And ENOTSYNC meaning 'MSIX' is pretty hacky. --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 40 - include/linux/kvm.h | 28 include/linux/kvm_host.h| 34 + virt/kvm/assigned-dev.c | 44 ++ virt/kvm/kvm_main.c | 38 +- virt/kvm/msix_mmio.c| 296 +++ virt/kvm/msix_mmio.h| 25 10 files changed, 497 insertions(+), 13 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,7 @@ enum emulation_result { EMULATE_DONE, /* no further processing */ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ EMULATE_FAIL, /* can't emulate this instruction */ + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */ }; #define EMULTYPE_NO_DECODE (1 0) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9cafbb4..912dca4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; /* fall through */ + case EMULATE_USERSPACE_EXIT: + /* fall through */ case EMULATE_FAIL: return 0; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21b84e2..87308eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, { gpa_t gpa; struct kvm_io_ext_data ext_data; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) + if (!r) return X86EMUL_CONTINUE; - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-run-mmio.len = vcpu-mmio_size = bytes; - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; - memcpy(vcpu-run-mmio.data, val, bytes); + if (r == -ENOTSYNC) { + vcpu-userspace_exit_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE; + vcpu-run-msix_routing.dev_id = + ext_data.msix_routing.dev_id; + vcpu-run-msix_routing.type = + ext_data.msix_routing.type; + vcpu-run-msix_routing.entry_idx = + ext_data.msix_routing.entry_idx; + vcpu-run-msix_routing.flags = + ext_data.msix_routing.flags; + } else { + vcpu-mmio_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; + vcpu-run-mmio.len = vcpu-mmio_size = bytes; + vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; + memcpy(vcpu-run-mmio.data, val, bytes); + } return X86EMUL_CONTINUE; } @@ -4469,6 +4485,8 @@ done:
Re: [PATCH 2/4] KVM: Add kvm_io_ext_data to IO handler
On Fri, Feb 25, 2011 at 11:23:30AM +0800, Sheng Yang wrote: On Thursday 24 February 2011 18:22:19 Michael S. Tsirkin wrote: On Thu, Feb 24, 2011 at 05:51:03PM +0800, Sheng Yang wrote: Add a new parameter to IO writing handler, so that we can transfer information from IO handler to caller. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/i8254.c |6 -- arch/x86/kvm/i8259.c |3 ++- arch/x86/kvm/lapic.c |3 ++- arch/x86/kvm/x86.c| 13 - include/linux/kvm_host.h | 12 ++-- virt/kvm/coalesced_mmio.c |3 ++- virt/kvm/eventfd.c|2 +- virt/kvm/ioapic.c |2 +- virt/kvm/iodev.h |6 -- virt/kvm/kvm_main.c |4 ++-- 10 files changed, 36 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index efad723..bd8f0c5 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -439,7 +439,8 @@ static inline int pit_in_range(gpa_t addr) } static int pit_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = dev_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; @@ -585,7 +586,8 @@ static int pit_ioport_read(struct kvm_io_device *this, } static int speaker_ioport_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *data) + gpa_t addr, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_pit *pit = speaker_to_pit(this); struct kvm_kpit_state *pit_state = pit-pit_state; diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 3cece05..96b1070 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -480,7 +480,8 @@ static inline struct kvm_pic *to_pic(struct kvm_io_device *dev) } static int picdev_write(struct kvm_io_device *this, - gpa_t addr, int len, const void *val) + gpa_t addr, int len, const void *val, + struct kvm_io_ext_data *ext_data) { struct kvm_pic *s = to_pic(this); unsigned char data = *(unsigned char *)val; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 93cf9d0..f413e9c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -836,7 +836,8 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) } static int apic_mmio_write(struct kvm_io_device *this, - gpa_t address, int len, const void *data) + gpa_t address, int len, const void *data, + struct kvm_io_ext_data *ext_data) { struct kvm_lapic *apic = to_lapic(this); unsigned int offset = address - apic-base_address; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..21b84e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3571,13 +3571,14 @@ static void kvm_init_msr_list(void) } static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, -const void *v) +const void *v, struct kvm_io_ext_data *ext_data) { if (vcpu-arch.apic - !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v)) + !kvm_iodevice_write(vcpu-arch.apic-dev, addr, len, v, ext_data)) return 0; - return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, addr, len, v); + return kvm_io_bus_write(vcpu-kvm, KVM_MMIO_BUS, + addr, len, v, ext_data); } static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v) @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + struct kvm_io_ext_data ext_data; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3825,7 +3827,7 @@ mmio: /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; @@ -3940,6 +3942,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) { /* TODO: String I/O for in kernel device */ int r; + struct kvm_io_ext_data ext_data; if (vcpu-arch.pio.in) r = kvm_io_bus_read(vcpu-kvm, KVM_PIO_BUS, vcpu
Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel
On Fri, Feb 25, 2011 at 02:28:02PM +0800, Sheng Yang wrote: On Thursday 24 February 2011 18:45:08 Michael S. Tsirkin wrote: On Thu, Feb 24, 2011 at 05:51:04PM +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Signed-off-by: Sheng Yang sh...@linux.intel.com Doesn't look like all comments got addressed. E.g. gpa_t entry_base is still there and in reality you said it's a host virtual address so should be void __user *; Would update it. And ENOTSYNC meaning 'MSIX' is pretty hacky. I'd like to discuss it later. We may need some work on all MMIO handling side to make it more straightforward. But I don't want to bundle it with this one... It's not PCI related so I'll defer to Avi/Marcelo on this. Are you guys happy with the ENOTSYNC meaning 'MSIX' and userspace_exit_needed hacks in this code? --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 40 - include/linux/kvm.h | 28 include/linux/kvm_host.h| 34 + virt/kvm/assigned-dev.c | 44 ++ virt/kvm/kvm_main.c | 38 +- virt/kvm/msix_mmio.c| 296 +++ virt/kvm/msix_mmio.h | 25 10 files changed, 497 insertions(+), 13 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,7 @@ enum emulation_result { EMULATE_DONE, /* no further processing */ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ EMULATE_FAIL, /* can't emulate this instruction */ + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */ }; #define EMULTYPE_NO_DECODE (1 0) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9cafbb4..912dca4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; /* fall through */ + case EMULATE_USERSPACE_EXIT: + /* fall through */ case EMULATE_FAIL: return 0; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21b84e2..87308eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, { gpa_t gpa; struct kvm_io_ext_data ext_data; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) + if (!r) return X86EMUL_CONTINUE; - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-run-mmio.len = vcpu-mmio_size = bytes; - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; - memcpy(vcpu-run-mmio.data, val, bytes); + if (r == -ENOTSYNC) { + vcpu-userspace_exit_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE; + vcpu-run-msix_routing.dev_id = + ext_data.msix_routing.dev_id; + vcpu-run-msix_routing.type = + ext_data.msix_routing.type; + vcpu-run
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Fri, Feb 25, 2011 at 02:12:42PM +0800, Sheng Yang wrote: On Thursday 24 February 2011 18:17:34 Michael S. Tsirkin wrote: On Thu, Feb 24, 2011 at 05:44:20PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 16:45:37 Michael S. Tsirkin wrote: On Wed, Feb 23, 2011 at 02:59:04PM +0800, Sheng Yang wrote: On Wednesday 23 February 2011 08:19:21 Alex Williamson wrote: On Sun, 2011-01-30 at 13:11 +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Looks pretty good overall. A few comments below. It seems like we should be able to hook this into vfio with a small stub in kvm. We just need to be able to communicate disabling and enabling of individual msix vectors. For brute force, we could do this with a lot of eventfds, triggered by kvm and consumed by vfio, two per MSI-X vector. Not sure if there's something smaller that could do it. Thanks, Alex, thanks for your comments. See my comments below: Alex Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/Makefile|2 +- arch/x86/kvm/x86.c |8 +- include/linux/kvm.h | 21 include/linux/kvm_host.h | 25 virt/kvm/assigned-dev.c | 44 +++ virt/kvm/kvm_main.c | 38 ++- virt/kvm/msix_mmio.c | 286 ++ virt/kvm/msix_mmio.h | 25 8 files changed, 442 insertions(+), 7 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fa708c9..89bf12c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3807,6 +3808,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, struct kvm_vcpu *vcpu) { gpa_t gpa; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3822,14 +3824,16 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val)) + if (!r) return X86EMUL_CONTINUE; vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-exit_reason = (r == -ENOTSYNC) ? + KVM_EXIT_MSIX_ROUTING_UPDATE : KVM_EXIT_MMIO; This use of -ENOTSYNC is IMO confusing. How about we make vcpu_mmio_write return the positive exit reason? Negative value will mean an error. In fact currently nagative value means something more need to be done, the same as MMIO exit. So it would be if (!r) return X86EMUL_CONTINUE; vcpu-run-exit_reason = r; Now I think we can keep it, or update them all later. The way to do this would be 1. patch to return KVM_EXIT_MMIO on mmio 2. your patch that returns KVM_EXIT_MSIX_ROUTING_UPDATE on top It's not that straightforward. In most condition, the reason vcpu_mmio_write() 0 because KVM itself unable to complete the request. That's quite straightforward. But each handler in the chain can't decided it would be KVM_EXIT_MMIO, they can only know when all of them fail to handle the accessing. Well, this just violates the standard API for return value. The standard semantics are: - negative - error - positive - not an error So you can just return
Re: Bug inkvm_set_irq
On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote: Hi, Each time i try tou use vhost_net, i'm facing a kernel bug. I do a modprobe vhost_net, and start guest whith vhost=on. Following is a trace with a kernel 2.6.37, but i had the same problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29). 2.6.36 had a theorectical race that could explain this, but it should be ok in 2.6.37. The bug only occurs whith vhost_net charged, so i don't know if this is a bug in kvm module code or in the vhost_net code. It could be a bug in eventfd which is the interface used by both kvm and vhost_net. Just for fun, you can try 3.6.38 - eventfd code has been changed a lot in 2.6.38 and if it does not trigger there it's a hint that irqfd is the reason. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243100] BUG: unable to handle kernel paging request at 2458 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243250] IP: [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Could you run markup_oops/ ksymoops on this please? As far as I can see kvm_set_irq can only get a wrong kvm pointer. Unless there's some general memory corruption, I'd guess You can also try comparing the irqfd-kvm pointer in kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in virt/kvm/eventfd.c. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243556] Oops: [#1] SMP Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243692] last sysfs file: /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243820] Modules linked in: vhost_net macvtap macvlan tun powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand fre q_table cpufreq_conservative fuse xt_physdev ip6t_LOG ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp nf_connt rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm ps mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses sd_mod enclosu re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2 scsi_mod ehci_hcd [last unloaded: scsi_wait_scan] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Pid: 10, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RIP: 0010:[a041aa8a] [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RSP: 0018:88045fc89d30 EFLAGS: 00010246 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RAX: RBX: 001a RCX: 0001 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RDX: RSI: RDI: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RBP: R08: 0001 R09: 880856a91e48 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R10: R11: R12: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R13: 0001 R14: R15: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] FS: 7f617986c710() GS:88007f80() knlGS: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CS: 0010 DS: ES: CR0: 8005003b Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CR2: 2458 CR3: 00045d197000 CR4: 06f0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR0: DR1: DR2: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR3: DR6: 0ff0 DR7: 0400 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Process kworker/0:1 (pid: 10, threadinfo 88045fc88000, task 88085fc53c30) Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] 88045fc89fd8 000119c0 88045fc88010 88085fc53ee8 Feb 23 13:56:19
Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net
On Mon, Feb 28, 2011 at 12:04:27PM +0530, Krishna Kumar wrote: This patch series is a continuation of an earlier one that implemented guest MQ TX functionality. This new patchset implements both RX and TX MQ. Qemu changes are not being included at this time solely to aid in easier review. Compatibility testing with old/new combinations of qemu/guest and vhost was done without any issues. Some early TCP/UDP test results are at the bottom of this post, I plan to submit more test results in the coming days. Please review and provide feedback on what can improve. Thanks! Signed-off-by: Krishna Kumar krkum...@in.ibm.com To help testing, could you post the qemu changes separately please? --- Test configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Each test case runs for 60 secs, results below are average over two runs. Bandwidth numbers are in gbps. I have used default netperf, and no testing/system tuning other than taskset each vhost to 0xf (cpus 0-3). Comparison is testing original kernel vs new kernel with #txqs=8 (# refers to number of netperf sessions). ___ TCP: Guest - Local Host (TCP_STREAM) #BW1BW2 (%) SD1SD2 (%) RSD1RSD2 (%) ___ 17190 7170 (-.2) 0 0 (0) 3 4 (33.3) 28774 11235 (28.0)3 3 (0) 16 14 (-12.5) 49753 15195 (55.7)17 21 (23.5) 65 59 (-9.2) 810224 18265 (78.6)71 115 (61.9) 251 240 (-4.3) 16 10749 18123 (68.6)277456 (64.6) 985 925 (-6.0) 32 11133 17351 (55.8)1132 1947 (71.9)39353831 (-2.6) 64 11223 17115 (52.4)4682 7836 (67.3)15949 15373 (-3.6) 128 11269 16986 (50.7)19783 31505 (59.2) 66799 61759 (-7.5) ___ Summary: BW: 37.6 SD: 61.2 RSD: -6.5 ___ TCP: Local Host - Guest (TCP_MAERTS) #BW1BW2 (%)SD1SD2 (%)RSD1RSD2 (%) ___ 111490 10870 (-5.3) 0 0 (0) 2 2 (0) 210612 10554 (-.5)2 3 (50.0) 12 12 (0) 410047 14320 (42.5) 13 16 (23.0) 53 53 (0) 89273 15182 (63.7) 56 84 (50.0) 228 233 (2.1) 16 9291 15853 (70.6) 235390 (65.9) 934 965 (3.3) 32 9382 15741 (67.7) 9691823 (88.1)38684037 (4.3) 64 9270 14585 (57.3) 3966 8836 (122.7) 15415 17818 (15.5) 128 8997 14678 (63.1) 17024 36649 (115.2) 64933 72677 (11.9) ___ SUM: BW: 24.8 SD: 114.6 RSD: 12.1 __ UDP: Local Host - Guest (UDP_STREAM) # BW1 BW2 (%)SD1SD2 (%) __ 1 1723616585 (-3.7)1 1 (0) 2 1679522693 (35.1)5 6 (20.0) 4 1339021502 (60.5)37 36 (-2.7) 8 1326124361 (83.7)163175 (7.3) 16 1277223796 (86.3)692826 (19.3) 32 1283223880 (86.0)2812 2871 (2.0) 64 1277924293 (90.1)11299 11237 (-.5) 1281300624857 (91.1)44778 43884 (-1.9) __ Summary: BW: 37.1 SD: -1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net
On Mon, Feb 28, 2011 at 12:04:37PM +0530, Krishna Kumar wrote: Implement mq virtio-net driver. Though struct virtio_net_config changes, it works with the old qemu since the last element is not accessed unless qemu sets VIRTIO_NET_F_NUMTXQS. Patch also adds a macro for the maximum number of TX vq's (VIRTIO_MAX_TXQS) that the user can specify. Signed-off-by: Krishna Kumar krkum...@in.ibm.com Overall looks good. The numtxqs meaning the number of rx queues needs some cleanup. init/cleanup routines need more symmetry. Error handling on setup also seems slightly buggy or at least asymmetrical. Finally, this will use up a large number of MSI vectors, while TX interrupts mostly stay unused. Some comments below. --- drivers/net/virtio_net.c | 543 --- include/linux/virtio_net.h |6 2 files changed, 386 insertions(+), 163 deletions(-) diff -ruNp org/include/linux/virtio_net.h new/include/linux/virtio_net.h --- org/include/linux/virtio_net.h2010-10-11 10:20:22.0 +0530 +++ new/include/linux/virtio_net.h2011-02-25 16:24:15.0 +0530 @@ -7,6 +7,9 @@ #include linux/virtio_config.h #include linux/if_ether.h +/* Maximum number of individual RX/TX queues supported */ +#define VIRTIO_MAX_TXQS 16 + This also does not seem to belong in the header. /* The feature bitmap for virtio net */ #define VIRTIO_NET_F_CSUM0 /* Host handles pkts w/ partial csum */ #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */ @@ -26,6 +29,7 @@ #define VIRTIO_NET_F_CTRL_RX 18 /* Control channel RX mode support */ #define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */ #define VIRTIO_NET_F_CTRL_RX_EXTRA 20/* Extra RX mode control support */ +#define VIRTIO_NET_F_NUMTXQS 21 /* Device supports multiple TX queue */ VIRTIO_NET_F_MULTIQUEUE ? #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ @@ -34,6 +38,8 @@ struct virtio_net_config { __u8 mac[6]; /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */ __u16 status; + /* number of RX/TX queues */ + __u16 numtxqs; The interface here is a bit ugly: - this is really both # of tx and rx queues but called numtxqs - there's a hardcoded max value - 0 is assumed to be same as 1 - assumptions above are undocumented. One way to address this could be num_queue_pairs, and something like /* The actual number of TX and RX queues is num_queue_pairs + 1 each. */ __u16 num_queue_pairs; (and tweak code to match). Alternatively, have separate registers for the number of tx and rx queues. } __attribute__((packed)); /* This is the first element of the scatter-gather list. If you don't diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c --- org/drivers/net/virtio_net.c 2011-02-21 17:55:42.0 +0530 +++ new/drivers/net/virtio_net.c 2011-02-25 16:23:41.0 +0530 @@ -40,31 +40,53 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; - struct net_device *dev; +/* Internal representation of a send virtqueue */ +struct send_queue { + struct virtqueue *svq; + + /* TX: fragments + linear part + virtio header */ + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; +}; + +/* Internal representation of a receive virtqueue */ +struct receive_queue { + /* Virtqueue associated with this receive_queue */ + struct virtqueue *rvq; + + /* Back pointer to the virtnet_info */ + struct virtnet_info *vi; + struct napi_struct napi; - unsigned int status; /* Number of input buffers, and max we've ever had. */ unsigned int num, max; - /* I like... big packets and I cannot lie! */ - bool big_packets; - - /* Host will merge rx buffers for big packets (shake it! shake it!) */ - bool mergeable_rx_bufs; - /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; /* Chain pages by the private ptr. */ struct page *pages; - /* fragments + linear part + virtio header */ + /* RX: fragments + linear part + virtio header */ struct scatterlist rx_sg[MAX_SKB_FRAGS + 2]; - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; +}; + +struct virtnet_info { + struct send_queue **sq; + struct receive_queue **rq; + + /* read-mostly variables */ + int numtxqs cacheline_aligned_in_smp; /* # of rxqs/txqs */ Why do you think this alignment is a win? + struct virtio_device *vdev; + struct virtqueue *cvq; + struct net_device *dev; + unsigned int status; + + /* I like... big packets and I cannot lie! */ + bool big_packets; + + /* Host will merge rx buffers for big packets (shake
Re: Bug inkvm_set_irq
On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote: Le 27/02/2011 18:00, Michael S. Tsirkin a écrit : On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote: Hi, Each time i try tou use vhost_net, i'm facing a kernel bug. I do a modprobe vhost_net, and start guest whith vhost=on. Following is a trace with a kernel 2.6.37, but i had the same problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29). 2.6.36 had a theorectical race that could explain this, but it should be ok in 2.6.37. The bug only occurs whith vhost_net charged, so i don't know if this is a bug in kvm module code or in the vhost_net code. It could be a bug in eventfd which is the interface used by both kvm and vhost_net. Just for fun, you can try 3.6.38 - eventfd code has been changed a lot in 2.6.38 and if it does not trigger there it's a hint that irqfd is the reason. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243100] BUG: unable to handle kernel paging request at 2458 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243250] IP: [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Could you run markup_oops/ ksymoops on this please? As far as I can see kvm_set_irq can only get a wrong kvm pointer. Unless there's some general memory corruption, I'd guess You can also try comparing the irqfd-kvm pointer in kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in virt/kvm/eventfd.c. Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243556] Oops: [#1] SMP Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243692] last sysfs file: /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243820] Modules linked in: vhost_net macvtap macvlan tun powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand fre q_table cpufreq_conservative fuse xt_physdev ip6t_LOG ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp nf_connt rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore snd_page_alloc tpm_tis tpm ps mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses sd_mod enclosu re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2 scsi_mod ehci_hcd [last unloaded: scsi_wait_scan] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Pid: 10, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RIP: 0010:[a041aa8a] [a041aa8a] kvm_set_irq+0x2a/0x130 [kvm] Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RSP: 0018:88045fc89d30 EFLAGS: 00010246 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RAX: RBX: 001a RCX: 0001 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RDX: RSI: RDI: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] RBP: R08: 0001 R09: 880856a91e48 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R10: R11: R12: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] R13: 0001 R14: R15: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] FS: 7f617986c710() GS:88007f80() knlGS: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CS: 0010 DS: ES: CR0: 8005003b Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] CR2: 2458 CR3: 00045d197000 CR4: 06f0 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR0: DR1: DR2: Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] DR3: DR6: 0ff0 DR7: 0400 Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Process kworker/0:1 (pid: 10, threadinfo 88045fc88000, task 88085fc53c30) Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack: Feb 23 13:56:19 ayrshire.u06.univ
Re: [PATCH 3/4] KVM: Emulate MSI-X table in kernel
On Mon, Feb 28, 2011 at 03:20:04PM +0800, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. Signed-off-by: Sheng Yang sh...@linux.intel.com A general question: we implement mmio read and write operation here, but seem to do nothing about ordering. In particular, pci mmio reads are normally assumed to flush all writes out to the device and all device writes in to the CPU. How exactly does it work here? --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/Makefile |2 +- arch/x86/kvm/mmu.c |2 + arch/x86/kvm/x86.c | 40 - include/linux/kvm.h | 28 include/linux/kvm_host.h| 36 + virt/kvm/assigned-dev.c | 44 ++ virt/kvm/kvm_main.c | 38 +- virt/kvm/msix_mmio.c| 301 +++ virt/kvm/msix_mmio.h| 25 10 files changed, 504 insertions(+), 13 deletions(-) create mode 100644 virt/kvm/msix_mmio.c create mode 100644 virt/kvm/msix_mmio.h diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index aa75f21..4a390a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,7 @@ enum emulation_result { EMULATE_DONE, /* no further processing */ EMULATE_DO_MMIO, /* kvm_run filled with mmio request */ EMULATE_FAIL, /* can't emulate this instruction */ + EMULATE_USERSPACE_EXIT, /* we need exit to userspace */ }; #define EMULTYPE_NO_DECODE (1 0) diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index f15501f..3a0d851 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o msix_mmio.o) kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o) kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9cafbb4..912dca4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3358,6 +3358,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, case EMULATE_DO_MMIO: ++vcpu-stat.mmio_exits; /* fall through */ + case EMULATE_USERSPACE_EXIT: + /* fall through */ case EMULATE_FAIL: return 0; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21b84e2..87308eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_MSIX_MMIO: r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3809,6 +3810,7 @@ static int emulator_write_emulated_onepage(unsigned long addr, { gpa_t gpa; struct kvm_io_ext_data ext_data; + int r; gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception); @@ -3824,18 +3826,32 @@ static int emulator_write_emulated_onepage(unsigned long addr, mmio: trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val); + r = vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data); /* * Is this MMIO handled locally? */ - if (!vcpu_mmio_write(vcpu, gpa, bytes, val, ext_data)) + if (!r) return X86EMUL_CONTINUE; - vcpu-mmio_needed = 1; - vcpu-run-exit_reason = KVM_EXIT_MMIO; - vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; - vcpu-run-mmio.len = vcpu-mmio_size = bytes; - vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; - memcpy(vcpu-run-mmio.data, val, bytes); + if (r == -ENOTSYNC) { Can you replace -ENOTSYNC with KVM_EXIT_MSIX_ROUTING_UPDATE all over please? + vcpu-userspace_exit_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MSIX_ROUTING_UPDATE; + vcpu-run-msix_routing.dev_id = + ext_data.msix_routing.dev_id; + vcpu-run-msix_routing.type = + ext_data.msix_routing.type; + vcpu-run-msix_routing.entry_idx = + ext_data.msix_routing.entry_idx; + vcpu-run-msix_routing.flags = + ext_data.msix_routing.flags; + } else { + vcpu-mmio_needed = 1; + vcpu-run-exit_reason = KVM_EXIT_MMIO; + vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa; + vcpu-run-mmio.len = vcpu-mmio_size = bytes; + vcpu-run-mmio.is_write = vcpu-mmio_is_write = 1; +