Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
On 6/25/2012 2:16 AM, Jason Wang wrote: Hello All: This series is an update version of multiqueue virtio-net driver based on Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the packets reception and transmission. Please review and comments. Test Environment: - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes - Two directed connected 82599 Test Summary: - Highlights: huge improvements on TCP_RR test - Lowlights: regression on small packet transmission, higher cpu utilization than single queue, need further optimization Does this also scale with increased number of VMs? Thanks Sridhar Analysis of the performance result: - I count the number of packets sending/receiving during the test, and multiqueue show much more ability in terms of packets per second. - For the tx regression, multiqueue send about 1-2 times of more packets compared to single queue, and the packets size were much smaller than single queue does. I suspect tcp does less batching in multiqueue, so I hack the tcp_write_xmit() to forece more batching, multiqueue works as well as singlequeue for both small transmission and throughput - I didn't pack the accelerate RFS with virtio-net in this sereis as it still need further shaping, for the one that interested in this please see: http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote: Propagate software FDB table into hardware uc, mc lists when the NETIF_F_HW_FDB is set. This resolves the case below where an embedded switch is used in hardware to do inter-VF or VF-PF switching. This patch pushes the FDB entry (specifically the MAC address) into the embedded switch with dev_add_uc and dev_add_mc so the switch learns about the software bridge. veth0 veth2 | | | bridge0 | software bridging / / ethx.y ethx VF PF \ \ propagate FDB entries to HW \ \ | Embedded Bridge | hardware offloaded switching This scenario works now as adding an interface to a bridge puts it in promiscuous mode. So adding a PF to a software bridge should not be a problem as it supports promiscuous mode. But adding a VF will not work. Are you trying to avoid the requirement of having to put the interface in promiscuous mode when adding to a bridge? Thanks Sridhar This is only an RFC couple more changes are needed. (1) Optimize HW FDB set/del to only walk list if an FDB offloaded device is attached. Or decide it doesn't matter from unlikely() path. (2) Is it good enough to just call dev_uc_{add|del} or dev_mc_{add|del}? Or do some devices really need a new netdev callback to do this operation correctly. I think it should be good enough as is. (3) wrapped list walk in rcu_read_lock() just in case maybe every case is already inside rcu_read_lock()/unlock(). Also this is in response to this thread regarding the macvlan and exposing rx filters posting now to see if folks think this is the right idea and if it will resolve at least the bridge case. http://lists.openwall.net/netdev/2011/11/08/135 Signed-off-by: John Fastabend john.r.fastab...@intel.com --- include/linux/netdev_features.h |2 ++ net/bridge/br_fdb.c | 34 ++ 2 files changed, 36 insertions(+), 0 deletions(-) diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index 77f5202..5936fae 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -55,6 +55,8 @@ enum { NETIF_F_NOCACHE_COPY_BIT, /* Use no-cache copyfromuser */ NETIF_F_LOOPBACK_BIT, /* Enable loopback */ + NETIF_F_HW_FDB, /* Hardware supports switching */ + /* * Add your fresh new feature above and remember to update * netdev_features_strings[] in net/core/ethtool.c and maybe diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index 5ba0c84..4cc545b 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -81,9 +81,26 @@ static void fdb_rcu_free(struct rcu_head *head) kmem_cache_free(br_fdb_cache, ent); } +static void fdb_hw_delete(struct net_bridge *br, + struct net_bridge_fdb_entry *fdb) +{ + struct net_bridge_port *op; + + rcu_read_lock(); + list_for_each_entry_rcu(op, br-port_list, list) { + struct net_device *dev = op-dev; + + if ((dev-features NETIF_F_HW_FDB) + dev != fdb-dst-dev) + dev_uc_del(dev, fdb-addr.addr); + } + rcu_read_unlock(); +} + static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f) { hlist_del_rcu(f-hlist); + fdb_hw_delete(br, f); fdb_notify(br, f, RTM_DELNEIGH); call_rcu(f-rcu, fdb_rcu_free); } @@ -350,6 +367,22 @@ static struct net_bridge_fdb_entry *fdb_find_rcu(struct hlist_head *head, return NULL; } +static void fdb_hw_create(struct net_bridge *br, + struct net_bridge_fdb_entry *fdb) +{ + struct net_bridge_port *op; + + rcu_read_lock(); + list_for_each_entry_rcu(op, br-port_list, list) { + struct net_device *dev = op-dev; + + if ((dev-features NETIF_F_HW_FDB) + dev != fdb-dst-dev) + dev_uc_add(dev, fdb-addr.addr); + } + rcu_read_unlock(); +} + static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head, struct net_bridge_port *source, const unsigned char *addr) @@ -363,6 +396,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head, fdb-is_local = 0; fdb-is_static = 0; fdb-updated = fdb-used = jiffies; + fdb_hw_create(source-br, fdb); hlist_add_head_rcu(fdb-hlist, head); } return fdb; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote: On 2/9/2012 10:14 AM, Sridhar Samudrala wrote: On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote: Propagate software FDB table into hardware uc, mc lists when the NETIF_F_HW_FDB is set. This resolves the case below where an embedded switch is used in hardware to do inter-VF or VF-PF switching. This patch pushes the FDB entry (specifically the MAC address) into the embedded switch with dev_add_uc and dev_add_mc so the switch learns about the software bridge. veth0 veth2 | | | bridge0 | software bridging / / ethx.y ethx VF PF \ \ propagate FDB entries to HW \ \ | Embedded Bridge | hardware offloaded switching This scenario works now as adding an interface to a bridge puts it in promiscuous mode. So adding a PF to a software bridge should not be a problem as it supports promiscuous mode. But adding a VF will not work. It shouldn't work because the embedded bridge will lookup the address in its FDB and when it doesn't match any unicast filters it will forward the packet onto the wire. Because the veth0 and veth2 above never get inserted into the embedded brdige's FDB the packets will _never_ get routed there. That said the current 'ixgbe' driver is doing something broken in that it is always setting the unicast hash table and mirroring bits to 1. So if you think this is working your seeing a bug where packets are being sent onto the wire AND upto the PF. Packets with destination addresses matching veth1 should not end up on the wire and vice versa. This is specific to ixgbe and is not the case for other SR-IOV devices. OK. Is this behavior going to be fixed. This causes some issues (a) has some very real performance implications, (b) at this point you have some strange behavior from my point of view. The embedded bridge is not a learning bridge nor is it acting like an 802.1Q VEB or VEPA. Are you trying to avoid the requirement of having to put the interface in promiscuous mode when adding to a bridge? I think the bridge being in promiscuous mode is correct. The interface that is added to the bridge is put in promiscuous mode, not the bridge itself. In this example, i assumed that setting promiscuous on PF is putting the embedded bridge in learning mode. Thanks Sridhar Hope that helps sorry its a bit long winded. John -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
On 12/7/2011 3:02 AM, Jason Wang wrote: On 12/06/2011 11:42 PM, Sridhar Samudrala wrote: On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com wrote: On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.com wrote: On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote: On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com wrote: The vcpus are just threads and may not be bound to physical CPUs, so what is the big picture here? Is the guest even in the position to set the best queue mappings today? Not sure it could publish the best mapping but the idea is to make sure the packets of a flow were handled by the same guest vcpu and may be the same vhost thread in order to eliminate the packet reordering and lock contention. But this assumption does not take the bouncing of vhost or vcpu threads which would also affect the result. Okay, this is why I'd like to know what the big picture here is. What solution are you proposing? How are we going to have everything from guest application, guest kernel, host threads, and host NIC driver play along so we get the right steering up the entire stack. I think there needs to be an answer to that before changing virtio-net to add any steering mechanism. Yes. Also the current model of a vhost thread per VM's interface doesn't help with packet steering all the way from the guest to the host physical NIC. I think we need to have vhost thread(s) per-CPU that can handle packets to/from physical NIC's TX/RX queues. Currently we have a single vhost thread for a VM's i/f that handles all the packets from various flows coming from a multi-queue physical NIC. Even if we have per-cpu workthread, only one socket is used to queue the packet then, so a multiple queue(sockets) tap/macvtap is still needed. I think so. We need per-cpu tap/macvtap sockets along with per-cpu vhost threads. This will parallelize all the way from physical NIC to vhost. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com wrote: On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.comwrote: On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote: On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com wrote: The vcpus are just threads and may not be bound to physical CPUs, so what is the big picture here? Is the guest even in the position to set the best queue mappings today? Not sure it could publish the best mapping but the idea is to make sure the packets of a flow were handled by the same guest vcpu and may be the same vhost thread in order to eliminate the packet reordering and lock contention. But this assumption does not take the bouncing of vhost or vcpu threads which would also affect the result. Okay, this is why I'd like to know what the big picture here is. What solution are you proposing? How are we going to have everything from guest application, guest kernel, host threads, and host NIC driver play along so we get the right steering up the entire stack. I think there needs to be an answer to that before changing virtio-net to add any steering mechanism. Yes. Also the current model of a vhost thread per VM's interface doesn't help with packet steering all the way from the guest to the host physical NIC. I think we need to have vhost thread(s) per-CPU that can handle packets to/from physical NIC's TX/RX queues. Currently we have a single vhost thread for a VM's i/f that handles all the packets from various flows coming from a multi-queue physical NIC. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
On 12/6/2011 8:14 AM, Michael S. Tsirkin wrote: On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote: On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com wrote: On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote: On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.com wrote: On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote: On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com wrote: The vcpus are just threads and may not be bound to physical CPUs, so what is the big picture here? Is the guest even in the position to set the best queue mappings today? Not sure it could publish the best mapping but the idea is to make sure the packets of a flow were handled by the same guest vcpu and may be the same vhost thread in order to eliminate the packet reordering and lock contention. But this assumption does not take the bouncing of vhost or vcpu threads which would also affect the result. Okay, this is why I'd like to know what the big picture here is. What solution are you proposing? How are we going to have everything from guest application, guest kernel, host threads, and host NIC driver play along so we get the right steering up the entire stack. I think there needs to be an answer to that before changing virtio-net to add any steering mechanism. Yes. Also the current model of a vhost thread per VM's interface doesn't help with packet steering all the way from the guest to the host physical NIC. I think we need to have vhost thread(s) per-CPU that can handle packets to/from physical NIC's TX/RX queues. Currently we have a single vhost thread for a VM's i/f that handles all the packets from various flows coming from a multi-queue physical NIC. Thanks Sridhar It's not hard to try that: 1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb this will convert our thread to a workqueue 2. convert the workqueue to a per-cpu one It didn't work that well in the past, but YMMV Yes. I tried this before we went ahead with per-interface vhost threading model. At that time, per-cpu vhost showed a regression with a single-VM and per-vq vhost showed good performance improvements upto 8 VMs. So just making it per-cpu would not be enough. I think we may need a way to schedule vcpu threads on the same cpu-socket as vhost. Another aspect we need to look into is the splitting of vhost thread into separate threads for TX and RX. Shirley is doing some work in this area and she is seeing perf. improvements as long as TX and RX threads are on the same cpu-socket. On the surface I'd say a single thread makes some sense as long as guest uses a single queue. But this may not be scalable long term when we want to support a large number of VMs each having multiple virtio-net interfaces with multiple queues. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
On 11/30/2011 3:00 PM, Chris Wright wrote: * Ben Hutchings (bhutchi...@solarflare.com) wrote: On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote: I agree that it's confusing. Couldn't you simplify your ascii art (hopefully removing hw assumptions about receive processing, and completely ignoring vlans for the moment) to something like: |RX v ++-+ | +--++| | | RX MAC filter || | |and port select|| | +---+| |/|\ | | / | \ match 2| | / v \ | | /match \| |/ 1 |\ | | / | \ | |match / | \ | | 0 / | \| |v|v | |||| | ++++---+ ||| PF VF 1 VF 2 And there's an unclear number of ways to update RX MAC filter and port select table. 1) PF ndo_set_mac_addr I expect that to be implicit to match 0. 2) PF ndo_set_rx_mode Less clear, but I'd still expect these to implicitly match 0 3) PF ndo_set_vf_mac I expect these to be an explicit match to VF N (given the interface specifices which VF's MAC is being programmed). I'm not sure whether this is supposed to implicitly add to the MAC filter or whether that has to be changed too. That's the main difference between my models (a) and (b). I see now. I wasn't entirely clear on the difference before. It's also going to be hw specific. I think (Intel folks can verify) that the Intel SR-IOV devices have a single global unicast exact match table, for example. There's also PF ndo_set_vf_vlan. Right, although I had mentioned I was trying to limit just to MAC filtering to simplify. 4) VF ndo_set_mac_addr This one may or may not be allowed (setting MAC+port if the VF is owned by a guest is likely not allowed), but would expect an implicit VF N. 5) VF ndo_set_rx_mode Same as 4) above. So this is where we are today. Cool, good that we agree there. 6) PF or VF? ndo_set_rx_filter_addr The new proposal, which has an explicit VF, although when it's VF_SELF I'm not clear if this is just the same as 5) above? Have I missed anything? Any physical port can be bridged to a mixture of guests with and without their own VFs. Packets sent from a guest with a VF to the address of a guest without a VF need to be forwarded to the PF rather than the physical port, but none of the drivers currently get to know about those addresses. To clarify, do you mean something like this? physical port | +++ | +-+ | | | VEB | | | +-+ | |/ | \| | /|\ | | / | \ | +-+--+--+-+ | | | PFVF 1VF 2 / | | +---+---+ VM4 +---+---+ | sw | |macvtap| | switch| +---+---+ +-+-+-+-+ | / | \VM5 / | \ VM1 VM2 VM3 This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv switching), VM4 directly owning VF1 (pci device assignement), and VM5 indirectly owning VF2 (macvtap passthrough, that started this whole thing). So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1 goes in to VEB, out PF, and into linux bridging code, rigth? At which point the PF is in promiscuous mode (btw, same does not work if bridge is attached to VF, at least for some VFs, due to lack of promiscuous mode). Packets sent from a guest with a VF to the address of another guest with a VF need to be forwarded similarly, but the driver should be able to infer that from (3). Right, and that works currently for the case where both guests are like VM4, they directly own the VF via PCI device assignement. But for VM4 to talk to VM5, VF3 is not in promiscuous mode and has a different MAC address than VM5's vNIC. If the embedded bridge does not learn, and nobody programmed it to fwd frames for VM5 via VF3... I think you are referring to VF2. There is no VF3 in your picture. In macvtap passthru mode, VF2 will be set to the same mac address as VM5's MAC. So VM4 should be be able to talk to VM5. I believe this is what Roopa's patch will allow. The question now is whether there's a better way to handle this? My understanding is that Roopa's patch will allow setting additional mac addresses to VM5 without the need to put VF5 in promiscous mode. Thanks Sridhar In my mind, we'd model the NIC's embedded bridge as, well, a bridge. And set anti-spoofing, port mirroring, port mac/vlan filtering, etc via that bridge. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/11/2011 6:18 AM, Roopa Prabhu wrote: On 9/11/11 2:44 AM, Michael S. Tsirkinm...@redhat.com wrote: AFAIK, though it might maintain a single filter table space in hw, hw does know which filter belongs to which VF. And the OS driver does not need to do anything special. The VF driver exposes a VF netdev. And any uc/mc addresses registered with a VF netdev are registered with the hw by the driver. And hw will filter and send only pkts that the VF has expressed interest in. No special filter partitioning in hw is required. Thanks, Roopa Yes, but what I mean is, if the size of the single filter table is limited, we need to decide how many addresses is each guest allowed. If we let one guest ask for as many as it wants, it can lock others out. Yes true. In these cases ie when the number of unicast addresses being registered is more than it can handle, The VF driver will put the VF in promiscuous mode (Or at least its supposed to do. I think all drivers do that). What does putting VF in promiscuous mode mean? How can the NIC decide which set of mac addresses are passed to the VF? Does it mean VF sees all the packets received by the NIC including packets destined for other VFs/PF? Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote: On 9/8/11 4:08 AM, Michael S. Tsirkin m...@redhat.com wrote: On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote: On 9/7/11 5:34 AM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote: This patch is an attempt at providing address filtering support for macvtap devices in PASSTHRU mode. Its still a work in progress. Briefly tested for basic functionality. Wanted to get some feedback on the direction before proceeding. Good work, thanks. Thanks. I have hopefully CC'ed all concerned people. kvm crowd might also be interested. Try using ./scripts/get_maintainer.pl as well. Thanks for the tip. Expanded CC list a bit more. PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode there is a 1-1 mapping between macvtap device and physical nic or VF. And all filtering is done in lowerdev hw. The lowerdev does not need to be in promiscous mode as long as the guest filters are passed down to the lowerdev. This patch tries to remove the need for putting the lowerdev in promiscous mode. I have also referred to the thread below where TUNSETTXFILTER was mentioned in this context: http://patchwork.ozlabs.org/patch/69297/ This patch basically passes the addresses got by TUNSETTXFILTER to macvlan lowerdev. I have looked at previous work and discussions on this for qemu-kvm by Michael Tsirkin, Alex Williamson and Dragos Tatulea http://patchwork.ozlabs.org/patch/78595/ http://patchwork.ozlabs.org/patch/47160/ https://patchwork.kernel.org/patch/474481/ Redhat bugzilla by Michael Tsirkin: https://bugzilla.redhat.com/show_bug.cgi?id=655013 I used Michael's qemu-kvm patch for testing the changes with KVM I would like to cover both MAC and vlan filtering in this work. Open Questions/Issues: - There is a need for vlan filtering to complete the patch. It will require a new tap ioctl cmd for vlans. Some ideas on this are: a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap filter (similar to tun_filter for addresses). Passing the vlan id's to lower device will mean going thru the whole list of vlans every time. OR b) TUNSETVLAN with vlan id and flag to set/unset Does option 'b' sound ok ? - In this implementation we make the macvlan address list same as the address list that came in the filter with TUNSETTXFILTER. This will not cover cases where the macvlan device needs to have other addresses that are not necessarily in the filter. Is this a problem ? What cases do you have in mind? This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't see a problem with uc/mc address list being the same in all the stacked netdevs in the path. I called that out above to make sure I was not missing any case in PASSTHRU mode where this might be invalid. Otherwise I don't see a problem in the simple PASSTHRU use case this patch supports. - The patch currently only supports passing of IFF_PROMISC and IFF_MULTICAST filter flags to lowerdev This patch series implements the following 01/3 - macvlan: Add support for unicast filtering in macvlan 02/3 - macvlan: Add function to set addr filter on lower device in passthru mode 03/3 - macvtap: Add support for TUNSETTXFILTER Please comment. Thanks. Signed-off-by: Roopa Prabhu ropra...@cisco.com Signed-off-by: Christian Benvenuti be...@cisco.com Signed-off-by: David Wang dwa...@cisco.com The security isn't lower than with promisc, so I don't see a problem with this as such. There are more features we'll want down the road though, so let's see whether the interface will be able to satisfy them in a backwards compatible way before we set it in stone. Here's what I came up with: How will the filtering table be partitioned within guests? Since this patch supports macvlan PASSTHRU mode only, in which the lower device has 1-1 mapping to the guest nic, it does not require any partitioning of filtering table within guests. Unless I missed understanding something. If the lower device were being shared by multiple guest network interfaces (non PASSTHRU mode), only then we will need to maintain separate filter tables for each guest network interface in macvlan and forward the pkt to respective guest interface after a filter lookup. This could affect performance too I think. Not with hardware filtering support. Which is where we'd need to partition the host nic mac table between guests. I need to understand this more. In non passthru case when a VF or physical nic is shared between guests, the nic does not really know about the guests, so I was thinking we do the same thing as we do for the passthru case (ie send all the
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/2011 8:00 PM, Roopa Prabhu wrote: On 9/8/11 12:33 PM, Michael S. Tsirkinm...@redhat.com wrote: On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote: I think the main usecase for passthru mode is to assign a SR-IOV VF to a single guest. Yes and for the passthru usecase this patch should be enough to enable filtering in hw (eventually like I indicated before I need to fix vlan filtering too). So with filtering in hw, and in sriov VF case, VFs actually share a filtering table. How will that be partitioned? AFAIK, though it might maintain a single filter table space in hw, hw does know which filter belongs to which VF. And the OS driver does not need to do anything special. The VF driver exposes a VF netdev. And any uc/mc addresses registered with a VF netdev are registered with the hw by the driver. And hw will filter and send only pkts that the VF has expressed interest in. Does your NIC driver support adding multiple mac addresses to a VF? I have tried a few other SR-IOV NICs sometime back and they didn't support this feature. Currently, we don't have an interface to add multiple mac addresses to a netdev other than an indirect way of creating a macvlan /if on top of it. Thanks Sridhar No special filter partitioning in hw is required. Thanks, Roopa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC PATCH 0/7] multiqueue support for tun/tap
On Fri, 2011-08-12 at 09:54 +0800, Jason Wang wrote: As multi-queue nics were commonly used for high-end servers, current single queue based tap can not satisfy the requirement of scaling guest network performance as the numbers of vcpus increase. So the following series implements multiple queue support in tun/tap. In order to take advantages of this, a multi-queue capable driver and qemu were also needed. I just rebase the latest version of Krishna's multi-queue virtio-net driver into this series to simplify the test. And for multiqueue supported qemu, you can refer the patches I post in http://www.spinics.net/lists/kvm/msg52808.html. Vhost is also a must to achieve high performance and its code could be used for multi-queue without modification. Alternatively, this series can be also used for Krishna's M:N implementation of multiqueue but I didn't test it. The idea is simple: each socket were abstracted as a queue for tun/tap, and userspace may open as many files as required and then attach them to the devices. In order to keep the ABI compatibility, device creation were still finished in TUNSETIFF, and two new ioctls TUNATTACHQUEUE and TUNDETACHQUEUE were added for user to manipulate the numbers of queues for the tun/tap. Is it possible to have tap create these queues automatically when TUNSETIFF is called instead of having userspace to do the new ioctls. I am just wondering if it is possible to get multi-queue to be enabled without any changes to qemu. I guess the number of queues could be based on the number of vhost threads/guest virtio-net queues. Also, is it possible to enable multi-queue on the host alone without any guest virtio-net changes? Have you done any multiple TCP_RR/UDP_RR testing with small packet sizes? 256byte request/response with 50-100 instances? I've done some basic performance testing of multi queue tap. For tun, I just test it through vpnc. Notes: - Test shows improvement when receving packets from local/external host to guest, and send big packet from guest to local/external host. - Current multiqueue based virtio-net/tap introduce a regression of send small packet (512 byte) from guest to local/external host. I suspect it's the issue of queue selection in both guest driver and tap. Would continue to investigate. - I would post the perforamnce numbers as a reply of this mail. TODO: - solve the issue of packet transmission of small packets. - addressing the comments of virtio-net driver - performance tunning Please review and comment it, Thanks. --- Jason Wang (5): tuntap: move socket/sock related structures to tun_file tuntap: categorize ioctl tuntap: introduce multiqueue related flags tuntap: multiqueue support tuntap: add ioctls to attach or detach a file form tap device Krishna Kumar (2): Change virtqueue structure virtio-net changes drivers/net/tun.c | 738 ++- drivers/net/virtio_net.c| 578 -- drivers/virtio/virtio_pci.c | 10 - include/linux/if_tun.h |5 include/linux/virtio.h |1 include/linux/virtio_net.h |3 6 files changed, 867 insertions(+), 468 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC 3/4] virtio_net: limit xmit polling
On Thu, 2011-06-02 at 18:43 +0300, Michael S. Tsirkin wrote: Current code might introduce a lot of latency variation if there are many pending bufs at the time we attempt to transmit a new one. This is bad for real-time applications and can't be good for TCP either. Free up just enough to both clean up all buffers eventually and to be able to xmit the next packet. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/net/virtio_net.c | 106 +- 1 files changed, 67 insertions(+), 39 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index a0ee78d..b25db1c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -509,17 +509,33 @@ again: return received; } -static void free_old_xmit_skbs(struct virtnet_info *vi) +static bool free_old_xmit_skb(struct virtnet_info *vi) { struct sk_buff *skb; unsigned int len; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { - pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; - dev_kfree_skb_any(skb); - } + skb = virtqueue_get_buf(vi-svq, len); + if (unlikely(!skb)) + return false; + pr_debug(Sent skb %p\n, skb); + vi-dev-stats.tx_bytes += skb-len; + vi-dev-stats.tx_packets++; + dev_kfree_skb_any(skb); + return true; +} + +/* Check capacity and try to free enough pending old buffers to enable queueing + * new ones. Return true if we can guarantee that a following + * virtqueue_add_buf will succeed. */ +static bool free_xmit_capacity(struct virtnet_info *vi) +{ + struct sk_buff *skb; + unsigned int len; + + while (virtqueue_min_capacity(vi-svq) MAX_SKB_FRAGS + 2) + if (unlikely(!free_old_xmit_skb)) + return false; If we are using INDIRECT descriptors, 1 descriptor entry is good enough to guarantee that an skb can be queued unless we run out of memory. Is it worth checking if 'indirect' is set on the svq and then only free 1 descriptor? Otherwise, we will be dropping the packet if there are less than 18 free descriptors although we ony need 1. + return true; } static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) @@ -572,30 +588,34 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); - int capacity; - - /* Free up any pending old buffers before queueing new ones. */ - free_old_xmit_skbs(vi); - - /* Try to transmit */ - capacity = xmit_skb(vi, skb); - - /* This can happen with OOM and indirect buffers. */ - if (unlikely(capacity 0)) { - if (net_ratelimit()) { - if (likely(capacity == -ENOMEM)) { - dev_warn(dev-dev, - TX queue failure: out of memory\n); - } else { - dev-stats.tx_fifo_errors++; + int ret, i; + + /* We normally do have space in the ring, so try to queue the skb as + * fast as possible. */ + ret = xmit_skb(vi, skb); + if (unlikely(ret 0)) { + /* This triggers on the first xmit after ring full condition. + * We need to free up some skbs first. */ + if (likely(free_xmit_capacity(vi))) { + ret = xmit_skb(vi, skb); + /* This should never fail. Check, just in case. */ + if (unlikely(ret 0)) { dev_warn(dev-dev, Unexpected TX queue failure: %d\n, - capacity); + ret); + dev-stats.tx_fifo_errors++; + dev-stats.tx_dropped++; + kfree_skb(skb); + return NETDEV_TX_OK; } + } else { + /* Ring full: it might happen if we get a callback while + * the queue is still mostly full. This should be + * extremely rare. */ + dev-stats.tx_dropped++; + kfree_skb(skb); + goto stop; } - dev-stats.tx_dropped++; - kfree_skb(skb); - return NETDEV_TX_OK; } virtqueue_kick(vi-svq); @@ -603,18 +623,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); - /* Apparently nice girls don't return TX_BUSY; stop the queue - * before it gets out of hand. Naturally, this wastes
Re: [PATCHv2 dontapply] vhost-net tx tuning
On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By default with it we will only interrupt when we see an empty ring. Which is liklely too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value like half ring * 200, or packets some small value etc. Set any one parameter to 0 to get current behaviour (interrupt immediately when enabled). Warning: completely untested. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 10; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 10; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 10; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += out; + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); I think the counters that exceed the limits need to be reset to 0 here. Otherwise we keep signaling for every buffer once we hit this condition. Thanks Sridhar + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced +bytes_coalesced +bufs_coalesced)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar The error is reported by virtio-pci which does not know about vhost. I started with EVIRTIO_MSIX_DISABLED and made is shorter. Would EVIRTIO_MSIX_DISABLED be better? I think so. This makes it more clear. -Sridhar hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance test result between per-vhost kthread disable and enable
On 11/23/2010 5:41 AM, Michael S. Tsirkin wrote: On Tue, Nov 23, 2010 at 09:23:41PM +0800, lidong chen wrote: At this point, I'd suggest testing vhost-net on the upstream kernel, not on rhel kernels. The change that introduced per-device threads is: c23f3445e68e1db0e74099f264bc5ff5d55ebdeb i will try this tomorrow. Is CONFIG_SCHED_DEBUG set? yes. CONFIG_SCHED_DEBUG=y. Disable it. Either debug scheduler or perf-test it :) Another debug option to disable is CONFIG_WORKQUEUE_TRACER if it is set when using old rhel6 kernels. -Sridhar 2010/11/23 Michael S. Tsirkinm...@redhat.com: On Tue, Nov 23, 2010 at 10:13:43AM +0800, lidong chen wrote: I test the performance between per-vhost kthread disable and enable. Test method: Send the same traffic load between per-vhost kthread disable and enable, and compare the cpu rate of host os. I run five vm on kvm, each of them have five nic. the vhost version which per-vhost kthread disable we used is rhel6 beta 2(2.6.32.60). the vhost version which per-vhost kthread enable we used is rhel6 (2.6.32-71). At this point, I'd suggest testing vhost-net on the upstream kernel, not on rhel kernels. The change that introduced per-device threads is: c23f3445e68e1db0e74099f264bc5ff5d55ebdeb Test result: with per-vhost kthread disable, the cpu rate of host os is 110%. with per-vhost kthread enable, the cpu rate of host os is 130%. Is CONFIG_SCHED_DEBUG set? We are stressing the scheduler a lot with vhost-net. In 2.6.32.60,the whole system only have a kthread. [r...@rhel6-kvm1 ~]# ps -ef | grep vhost root 973 2 0 Nov22 ?00:00:00 [vhost] In 2.6.32.71,the whole system have 25 kthread. [r...@kvm-4slot ~]# ps -ef | grep vhost- root 12896 2 0 10:26 ?00:00:00 [vhost-12842] root 12897 2 0 10:26 ?00:00:00 [vhost-12842] root 12898 2 0 10:26 ?00:00:00 [vhost-12842] root 12899 2 0 10:26 ?00:00:00 [vhost-12842] root 12900 2 0 10:26 ?00:00:00 [vhost-12842] root 13022 2 0 10:26 ?00:00:00 [vhost-12981] root 13023 2 0 10:26 ?00:00:00 [vhost-12981] root 13024 2 0 10:26 ?00:00:00 [vhost-12981] root 13025 2 0 10:26 ?00:00:00 [vhost-12981] root 13026 2 0 10:26 ?00:00:00 [vhost-12981] root 13146 2 0 10:26 ?00:00:00 [vhost-13088] root 13147 2 0 10:26 ?00:00:00 [vhost-13088] root 13148 2 0 10:26 ?00:00:00 [vhost-13088] root 13149 2 0 10:26 ?00:00:00 [vhost-13088] root 13150 2 0 10:26 ?00:00:00 [vhost-13088] ... Code difference: In 2.6.32.60,in function vhost_init, create the kthread for vhost. vhost_workqueue = create_singlethread_workqueue(vhost); In 2.6.32.71,in function vhost_dev_set_owner, create the kthread for each nic interface. dev-wq = create_singlethread_workqueue(vhost_name); Conclusion: with per-vhost kthread enable, the system can more throughput. but deal the same traffic load with per-vhost kthread enable, it waste more cpu resource. In my application scene, the cpu resource is more important, and one kthread for deal with traffic load is enough. So i think we should add a param to control this. for the CPU-bound system, this param disable per-vhost kthread. for the I/O-bound system, this param enable per-vhost kthread. the default value of this param is enable. If my opinion is right, i will give a patch for this. Let's try to figure out what the issue is, first. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] macvlan: Introduce 'passthru' mode to takeover the underlying device
On 10/29/2010 6:45 AM, Arnd Bergmann wrote: On Friday 29 October 2010, Sridhar Samudrala wrote: With the current default 'vepa' mode, a KVM guest using virtio with macvtap backend has the following limitations. - cannot change/add a mac address on the guest virtio-net I believe this could be changed if there is a neeed, but I actually consider it one of the design points of macvlan that the guest is not able to change the mac address. With 802.1Qbg you rely on the switch being able to identify the guest by its MAC address, which the host kernel must ensure. Currently the host cannot prevent a guest user from trying to change/add a mac address on the guest virtio-net. From guest point of view, the request succeeds, but the incoming packets are dropped siliently by the host interface. - cannot create a vlan device on the guest virtio-net Why not? If this doesn't work, it's probably a bug! Because the host is not aware of the guest vlan tag and the host external interface will filter out incoming vlan tagged packets. Why does the passthru mode enable it if it doesn't work already? passthru mode puts the host external interface in promiscuous mode which allows vlan tagged packets to be received. Even in tap/bridge mode, this works because adding an external interface to the bridge causes it to be put in promiscuous mode. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Thu, 2010-10-28 at 13:13 -0700, Shirley Ma wrote: On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote: Also I found a big TX regression for old guest and new guest. For old guest, I am able to get almost 11Gb/s for 2K message size, but for the new guest kernel, I can only get 3.5 Gb/s with the patch and same host. I will dig it why. The regression is from guest kernel, not from this patch. Tested 2.6.31 kernel, it's performance is less than 2Gb/s for 2K message size already. I will resubmit the patch for review. I will start to test from 2.6.30 kernel to figure it when TX regression induced in virtio_net. Any suggestion which guest kernel I should test to figure out this regression? It would be some change in virtio-net driver that may have improved the latency of small messages which in turn would have reduced the bandwidth as TCP could not accumulate and send large packets. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2] Support 'mode' parameter when creating macvtap device
Add support for 'mode' parameter when creating a macvtap device. This allows a macvtap device to be created in bridge, private or the default vepa modes. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- diff --git a/ip/Makefile b/ip/Makefile index 2f223ca..6054e8a 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \ ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \ ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \ iplink_vlan.o link_veth.o link_gre.o iplink_can.o \ -iplink_macvlan.o +iplink_macvlan.o iplink_macvtap.o RTMONOBJ=rtmon.o diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c new file mode 100644 index 000..35199b1 --- /dev/null +++ b/ip/iplink_macvtap.c @@ -0,0 +1,90 @@ +/* + * iplink_macvtap.cmacvtap device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void explain(void) +{ + fprintf(stderr, + Usage: ... macvtap mode { private | vepa | bridge }\n + ); +} + +static int mode_arg(void) +{ +fprintf(stderr, Error: argument of \mode\ must be \private\, + \vepa\ or \bridge\\n); +return -1; +} + +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, mode) == 0) { + __u32 mode = 0; + NEXT_ARG(); + + if (strcmp(*argv, private) == 0) + mode = MACVLAN_MODE_PRIVATE; + else if (strcmp(*argv, vepa) == 0) + mode = MACVLAN_MODE_VEPA; + else if (strcmp(*argv, bridge) == 0) + mode = MACVLAN_MODE_BRIDGE; + else + return mode_arg(); + + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, macvtap: what is \%s\?\n, *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + __u32 mode; + + if (!tb) + return; + + if (!tb[IFLA_MACVLAN_MODE] || + RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) sizeof(__u32)) + return; + + mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]); + fprintf(f, mode %s , + mode == MACVLAN_MODE_PRIVATE ? private + : mode == MACVLAN_MODE_VEPA? vepa + : mode == MACVLAN_MODE_BRIDGE ? bridge + :unknown); +} + +struct link_util macvtap_link_util = { + .id = macvtap, + .maxattr= IFLA_MACVLAN_MAX, + .parse_opt = macvtap_parse_opt, + .print_opt = macvtap_print_opt, +}; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2] macvlan/macvtap: support 'passthru' mode
Add support for 'passthru' mode when creating a macvlan/macvtap device which allows takeover of the underlying device and passing it to a KVM guest using virtio with macvtap backend. Only one macvlan device is allowed in passthru mode and it inherits the mac address from the underlying device and sets it in promiscuous mode to receive and forward all the packets. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- diff --git a/include/linux/if_link.h b/include/linux/if_link.h index f5bb2dc..23de79e 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -230,6 +230,7 @@ enum macvlan_mode { MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */ + MACVLAN_MODE_PASSTHRU = 8, /* take over the underlying device */ }; /* SR-IOV virtual function management section */ diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c index a3c78bd..15022aa 100644 --- a/ip/iplink_macvlan.c +++ b/ip/iplink_macvlan.c @@ -23,14 +23,14 @@ static void explain(void) { fprintf(stderr, - Usage: ... macvlan mode { private | vepa | bridge }\n + Usage: ... macvlan mode { private | vepa | bridge | passthru }\n ); } static int mode_arg(void) { fprintf(stderr, Error: argument of \mode\ must be \private\, - \vepa\ or \bridge\\n); + \vepa\, \bridge\ or \passthru\ \n); return -1; } @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c index 35199b1..5665b6d 100644 --- a/ip/iplink_macvtap.c +++ b/ip/iplink_macvtap.c @@ -20,14 +20,14 @@ static void explain(void) { fprintf(stderr, - Usage: ... macvtap mode { private | vepa | bridge }\n + Usage: ... macvtap mode { private | vepa | bridge | passthru }\n ); } static int mode_arg(void) { fprintf(stderr, Error: argument of \mode\ must be \private\, - \vepa\ or \bridge\\n); + \vepa\, \bridge\ or \passthru\ \n); return -1; } @@ -45,6 +45,8 @@ static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -79,6 +81,7 @@ static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices
Support a new 'passthru' mode with macvlan and 'mode' parameter with macvtap devices. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/include/linux/if_link.h b/include/linux/if_link.h index f5bb2dc..23de79e 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -230,6 +230,7 @@ enum macvlan_mode { MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */ + MACVLAN_MODE_PASSTHRU = 8, /* take over the underlying device */ }; /* SR-IOV virtual function management section */ diff --git a/ip/Makefile b/ip/Makefile index 2f223ca..6054e8a 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \ ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \ ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \ iplink_vlan.o link_veth.o link_gre.o iplink_can.o \ -iplink_macvlan.o +iplink_macvlan.o iplink_macvtap.o RTMONOBJ=rtmon.o diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c index a3c78bd..97787f9 100644 --- a/ip/iplink_macvlan.c +++ b/ip/iplink_macvlan.c @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c new file mode 100644 index 000..040cc68 --- /dev/null +++ b/ip/iplink_macvtap.c @@ -0,0 +1,93 @@ +/* + * iplink_macvtap.cmacvtap device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void explain(void) +{ + fprintf(stderr, + Usage: ... macvtap mode { private | vepa | bridge | passthru }\n + ); +} + +static int mode_arg(void) +{ +fprintf(stderr, Error: argument of \mode\ must be \private\, + \vepa\ or \bridge\ \passthru\\n); +return -1; +} + +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, mode) == 0) { + __u32 mode = 0; + NEXT_ARG(); + + if (strcmp(*argv, private) == 0) + mode = MACVLAN_MODE_PRIVATE; + else if (strcmp(*argv, vepa) == 0) + mode = MACVLAN_MODE_VEPA; + else if (strcmp(*argv, bridge) == 0) + mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; + else + return mode_arg(); + + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, macvtap: what is \%s\?\n, *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + __u32 mode; + + if (!tb) + return; + + if (!tb[IFLA_MACVLAN_MODE] || + RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) sizeof(__u32)) + return; + + mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]); + fprintf(f, mode %s , + mode == MACVLAN_MODE_PRIVATE ? private + : mode == MACVLAN_MODE_VEPA? vepa + : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode
[RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device
With the current default macvtap mode, a KVM guest using virtio with macvtap backend has the following limitations. - cannot change/add a mac address on the guest virtio-net - cannot create a vlan device on the guest virtio-net - cannot enable promiscuous mode on guest virtio-net This patch introduces a new mode called 'passthru' when creating a macvlan device which allows takeover of the underlying device and passing it to a guest using virtio with macvtap backend. Only one macvlan device is allowed in passthru mode and it inherits the mac address from the underlying device and sets it in promiscuous mode to receive and forward all the packets. Thanks Sridhar diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 0ef0eb0..bca3cb7 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -38,6 +38,7 @@ struct macvlan_port { struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; struct list_headvlans; struct rcu_head rcu; + boolpassthru; }; #define macvlan_port_get_rcu(dev) \ @@ -169,6 +170,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE | MACVLAN_MODE_VEPA| + MACVLAN_MODE_PASSTHRU| MACVLAN_MODE_BRIDGE); else if (src-mode == MACVLAN_MODE_VEPA) /* flood to everyone except source */ @@ -185,7 +187,10 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) return skb; } - vlan = macvlan_hash_lookup(port, eth-h_dest); + if (port-passthru) + vlan = list_first_entry(port-vlans, struct macvlan_dev, list); + else + vlan = macvlan_hash_lookup(port, eth-h_dest); if (vlan == NULL) return skb; @@ -284,6 +289,11 @@ static int macvlan_open(struct net_device *dev) struct net_device *lowerdev = vlan-lowerdev; int err; + if (vlan-port-passthru) { + dev_set_promiscuity(lowerdev, 1); + goto hash_add; + } + err = -EBUSY; if (macvlan_addr_busy(vlan-port, dev-dev_addr)) goto out; @@ -296,6 +306,8 @@ static int macvlan_open(struct net_device *dev) if (err 0) goto del_unicast; } + +hash_add: macvlan_hash_add(vlan); return 0; @@ -310,12 +322,18 @@ static int macvlan_stop(struct net_device *dev) struct macvlan_dev *vlan = netdev_priv(dev); struct net_device *lowerdev = vlan-lowerdev; + if (vlan-port-passthru) { + dev_set_promiscuity(lowerdev, -1); + goto hash_del; + } + dev_mc_unsync(lowerdev, dev); if (dev-flags IFF_ALLMULTI) dev_set_allmulti(lowerdev, -1); dev_uc_del(lowerdev, dev-dev_addr); +hash_del: macvlan_hash_del(vlan); return 0; } @@ -549,6 +567,7 @@ static int macvlan_port_create(struct net_device *dev) if (port == NULL) return -ENOMEM; + port-passthru = false; port-dev = dev; INIT_LIST_HEAD(port-vlans); for (i = 0; i MACVLAN_HASH_SIZE; i++) @@ -593,6 +612,7 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) case MACVLAN_MODE_PRIVATE: case MACVLAN_MODE_VEPA: case MACVLAN_MODE_BRIDGE: + case MACVLAN_MODE_PASSTHRU: break; default: return -EINVAL; @@ -661,6 +681,10 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev, } port = macvlan_port_get(lowerdev); + /* Only 1 macvlan device can be created in passthru mode */ + if (port-passthru) + return -EINVAL; + vlan-lowerdev = lowerdev; vlan-dev = dev; vlan-port = port; @@ -671,6 +695,13 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev, if (data data[IFLA_MACVLAN_MODE]) vlan-mode = nla_get_u32(data[IFLA_MACVLAN_MODE]); + if (vlan-mode == MACVLAN_MODE_PASSTHRU) { + if (!list_empty(port-vlans)) + return -EINVAL; + port-passthru = true; + memcpy(dev-dev_addr, lowerdev-dev_addr, ETH_ALEN); + } + err = register_netdevice(dev); if (err 0) goto destroy_port; diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 2fc66dd..8454805 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -232,6 +232,7 @@ enum macvlan_mode { MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ MACVLAN_MODE_VEPA=
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On 9/28/2010 8:18 AM, Arnd Bergmann wrote: On Tuesday 28 September 2010, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote: Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? I expect the protocol value to be wrong when guest sends vlan tagged frames as 802.1q frames have a different format. Ok, I see. Would that be fixed by using eth_type_trans()? I don't see any code in there that tries to deal with the VLAN tag, so do we have the same problem in the tun/tap driver? tun_get_user() does call eth_type_trans(). Not sure why i didn't use it in macvtap code. Need to test it with guest VLAN tagging to make sure it works. Also, I wonder how we handle the case where both the guest and the host do VLAN tagging. Does the host transparently override the guest tag, or does it add a nested tag? More importantly, what should it do? I would think If both guest and host do VLAN tagging, the tags will be nested. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
On 9/9/2010 2:45 AM, Krishna Kumar2 wrote: Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM: Some more results and likely cause for single netperf degradation below. Guest - Host (single netperf): I am getting a drop of almost 20%. I am trying to figure out why. Host - guest (single netperf): I am getting an improvement of almost 15%. Again - unexpected. Guest - Host TCP_RR: I get an average 7.4% increase in #packets for runs upto 128 sessions. With fewer netperf (under 8), there was a drop of 3-7% in #packets, but beyond that, the #packets improved significantly to give an average improvement of 7.4%. So it seems that fewer sessions is having negative effect for some reason on the tx side. The code path in virtio-net has not changed much, so the drop in some cases is quite unexpected. The drop for the single netperf seems to be due to multiple vhost. I changed the patch to start *single* vhost: Guest - Host (1 netperf, 64K): BW: 10.79%, SD: -1.45% Guest - Host (1 netperf) : Latency: -3%, SD: 3.5% I remember seeing similar issue when using a separate vhost thread for TX and RX queues. Basically, we should have the same vhost thread process a TCP flow in both directions. I guess this allows the data and ACKs to be processed in sync. Thanks Sridhar Single vhost performs well but hits the barrier at 16 netperf sessions: SINGLE vhost (Guest - Host): 1 netperf:BW: 10.7% SD: -1.4% 4 netperfs: BW: 3%SD: 1.4% 8 netperfs: BW: 17.7% SD: -10% 16 netperfs: BW: 4.7% SD: -7.0% 32 netperfs: BW: -6.1% SD: -5.7% BW and SD both improves (guest multiple txqs help). For 32 netperfs, SD improves. But with multiple vhosts, guest is able to send more packets and BW increases much more (SD too increases, but I think that is expected). From the earlier results: N# BW1 BW2(%) SD1 SD2(%) RSD1RSD2(%) ___ 4 26387 40716 (54.30) 20 28 (40.00)86 85 (-1.16) 8 24356 41843 (71.79) 88 129 (46.59)372 362 (-2.68) 16 23587 40546 (71.89) 375 564 (50.40)15581519 (-2.50) 32 22927 39490 (72.24) 16172171 (34.26)66945722 (-14.52) 48 23067 39238 (70.10) 39315170 (31.51)15823 13552 (-14.35) 64 22927 38750 (69.01) 71429914 (38.81)28972 26173 (-9.66) 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74) ___ (All tests were done without any tuning) From my testing: 1. Single vhost improves mq guest performance upto 16 netperfs but degrades after that. 2. Multiple vhost degrades single netperf guest performance, but significantly improves performance for any number of netperf sessions. Likely cause for the 1 stream degradation with multiple vhost patch: 1. Two vhosts run handling the RX and TX respectively. I think the issue is related to cache ping-pong esp since these run on different cpus/sockets. 2. I (re-)modified the patch to share RX with TX[0]. The performance drop is the same, but the reason is the guest is not using txq[0] in most cases (dev_pick_tx), so vhost's rx and tx are running on different threads. But whenever the guest uses txq[0], only one vhost runs and the performance is similar to original. I went back to my *submitted* patch and started a guest with numtxq=16 and pinned every vhost to cpus #01. Now whether guest used txq[0] or txq[n], the performance is similar or better (between 10-27% across 10 runs) than original code. Also, -6% to -24% improvement in SD. I will start a full test run of original vs submitted code with minimal tuning (Avi also suggested the same), and re-send. Please let me know if you need any other data. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
On Fri, 2010-09-03 at 13:14 +0300, Michael S. Tsirkin wrote: On Tue, Aug 10, 2010 at 06:23:24PM -0700, Shirley Ma wrote: Hello Xiaohui, On Fri, 2010-08-06 at 17:23 +0800, xiaohui@intel.com wrote: Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. Have you had any performance data to share here? I tested my experimental macvtap zero copy for TX only. The performance I have seen as below without any tuning, (default setting): Before: netperf 16K message size results with 60 secs run is 7.5Gb/s over ixgbe 10GbE card. perf top shows: 2103.00 12.9% copy_user_generic_string 1541.00 9.4% handle_tx 1490.00 9.1% _raw_spin_unlock_irqrestore 1361.00 8.3% _raw_spin_lock_irqsave 1288.00 7.9% _raw_spin_lock 924.00 5.7% vhost_worker After: netperf results with 60 secs run is 8.1Gb/s, perf output: 1093.00 9.9% _raw_spin_unlock_irqrestore 1048.00 9.5% handle_tx 934.00 8.5% _raw_spin_lock_irqsave 864.00 7.9% _raw_spin_lock 644.00 5.9% vhost_worker 387.00 3.5% use_mm I am still working on collecting more data (latency, cpu utilization...). I will let you know once I get all data for macvtap TX zero copy. Also I found some vhost performance regression on the new kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. Shirley Could you please try disabling mergeable buffers, and see if this gets you back where you were? -global virtio-net-pci.mrg_rxbuf=off I don't think Shirley had mergeable buffers on when she ran these tests. The qemu patch to support mergeable buffers with vhost is not yet upstream. Shirley is on vacation and will be back on Sept 7 and can provide more detailed performance data and post her patch. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cgroups: fix API thinko
On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote: cgroup_attach_task_current_cg API that have upstream is backwards: we really need an API to attach to the cgroups from another process A to the current one. In our case (vhost), a priveledged user wants to attach it's task to cgroups from a less priveledged one, the API makes us run it in the other task's context, and this fails. So let's make the API generic and just pass in 'from' and 'to' tasks. Add an inline wrapper for cgroup_attach_task_current_cg to avoid breaking bisect. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Paul, Li, Sridhar, could you please review the following patch? I only compile-tested it due to travel, but looks straight-forward to me. Alex Williamson volunteered to test and report the results. Sending out now for review as I might be offline for a bit. Will only try to merge when done, obviously. If OK, I would like to merge this through -net tree, together with the patch fixing vhost-net. Let me know if that sounds ok. Thanks! This patch is on top of net-next, it is needed for fix vhost-net regression in net-next, where a non-priveledged process can't enable the device anymore: when qemu uses vhost, inside the ioctl call it creates a thread, and tries to add this thread to the groups of current, and it fails. But we control the thread, so to solve the problem, we really should tell it 'connect to out cgroups'. So an unprivileged qemu cannot attach vhost thread to its own cgroups. I guess you are planning to make the cgroup_attach_task_all() call in vhost_worker() to attach itself to the cgroups of qemu. The new API looks fine, but the name is little confusing. How about task_inherit_cgroups(struct task_struct *from, struct task_struct *to) What this patch does is add an API for that. include/linux/cgroup.h | 11 ++- kernel/cgroup.c|9 + 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 43b2072..b38ec60 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -525,7 +525,11 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp, void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it); int cgroup_scan_tasks(struct cgroup_scanner *scan); int cgroup_attach_task(struct cgroup *, struct task_struct *); -int cgroup_attach_task_current_cg(struct task_struct *); +int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); +static inline int cgroup_attach_task_current_cg(struct task_struct *tsk) +{ + return cgroup_attach_task_all(current, tsk); +} /* * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works @@ -583,6 +587,11 @@ static inline int cgroupstats_build(struct cgroupstats *stats, } /* No cgroups - nothing to do */ +static inline int cgroup_attach_task_all(struct task_struct *from, +struct task_struct *t) +{ + return 0; +} static inline int cgroup_attach_task_current_cg(struct task_struct *t) { return 0; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index dce8ebc..e6293b8 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1606,10 +1606,11 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) } /** - * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup + * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from' + * @from: attach to all cgroups of a given task * @tsk: the task to be attached */ -int cgroup_attach_task_current_cg(struct task_struct *tsk) +int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) { struct cgroupfs_root *root; struct cgroup *cur_cg; @@ -1617,7 +1618,7 @@ int cgroup_attach_task_current_cg(struct task_struct *tsk) cgroup_lock(); for_each_active_root(root) { - cur_cg = task_cgroup_from_root(current, root); + cur_cg = task_cgroup_from_root(from, root); Now that we are not operating on current, cur_cg should be renamed as from_cg retval = cgroup_attach_task(cur_cg, tsk); if (retval) break; @@ -1626,7 +1627,7 @@ int cgroup_attach_task_current_cg(struct task_struct *tsk) return retval; } -EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg); +EXPORT_SYMBOL_GPL(cgroup_attach_task_all); /* * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Make vhost multi-threaded and associate each thread to its guest's cgroup
On Tue, 2010-07-27 at 23:42 +0300, Michael S. Tsirkin wrote: Sridhar, I pushed a patchset with all known issues fixed, on my vhost-net-next branch. For now this ignores the cpu mask issue, addressing only the cgroups issue. Would appreciate testing and reports. I had to apply the following patch to get it build. With this patch, i am seeing similar results as i saw earlier. Thanks Sridhar diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 996e751..8543898 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -23,6 +23,7 @@ #include linux/highmem.h #include linux/slab.h #include linux/kthread.h +#include linux/cgroup.h #include linux/net.h #include linux/if_packet.h @@ -252,7 +253,7 @@ static long vhost_dev_set_owner(struct vhost_dev *dev) } dev-worker = worker; - err = cgroup_attach_task_current_cg(poller); + err = cgroup_attach_task_current_cg(worker); if (err) goto err_cgroup; wake_up_process(worker);/* avoid contributing to loadavg */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH repost] sched: export sched_set/getaffinity to modules
On Mon, 2010-07-26 at 20:12 +0300, Michael S. Tsirkin wrote: On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote: On 07/02, Peter Zijlstra wrote: On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote: Does it (Tejun's kthread_clone() patch) also inherit the cgroup of the caller? Of course, its a simple do_fork() which inherits everything just as you would expect from a similar sys_clone()/sys_fork() call. Yes. And I'm afraid it can inherit more than we want. IIUC, this is called from ioctl(), right? Then the new thread becomes the natural child of the caller, and it shares -mm with the parent. And files, dup_fd() without CLONE_FS. Signals. Say, if you send SIGKILL to this new thread, it can't sleep in TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent just because the parent gets SIGQUIT or abother coredumpable signal. Or the new thread can recieve SIGSTOP via ^Z. Perhaps this is OK, I do not know. Just to remind that kernel_thread() is merely clone(CLONE_VM). Oleg. With some machinery to stop it later, yes. Oleg, how does the below look to you? Here I explicitly drop the fds so we don't share them. CLONE_VM takes care of sharing the mm I think. About signals - for the vhost-net use this is OK as we use uninterruptible sleep anyway (like the new kthread_worker does). This code seems to work fine for me so far - any comments? --- diff --git a/include/linux/kthread.h b/include/linux/kthread.h index aabc8a1..72c7b17 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), const char namefmt[], ...) __attribute__((format(printf, 3, 4))); +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data), +void *data, +const char namefmt[], ...) + __attribute__((format(printf, 3, 4))); + /** * kthread_run - create and wake a thread. * @threadfn: the function to run until signal_pending(current). diff --git a/kernel/kthread.c b/kernel/kthread.c index 83911c7..b81588c 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), } EXPORT_SYMBOL(kthread_create); +/* Same as kthread_create, but inherit attributes (cgroups, priority, CPU mask) + * from current. */ +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data), +void *data, +const char namefmt[], +...) +{ + struct kthread_create_info create; + + create.threadfn = threadfn; + create.data = data; + init_completion(create.done); + + create_kthread(create); + wait_for_completion(create.done); + + if (!IS_ERR(create.result)) { + va_list args; + + /* Don't share files with parent as drivers use release for + * close on exit, etc. */ + exit_files(create.result); + + va_start(args, namefmt); + vsnprintf(create.result-comm, sizeof(create.result-comm), + namefmt, args); + va_end(args); + } + return create.result; +} +EXPORT_SYMBOL(kthread_create_inherit); + /** * kthread_bind - bind a just-created kthread to a cpu. * @p: thread created by kthread_create(). I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES flag rather than create_kthread() and then closing the files. Either version should be fine. diff --git a/include/linux/kthread.h b/include/linux/kthread.h index aabc8a1..634eaf7 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), const char namefmt[], ...) __attribute__((format(printf, 3, 4))); +struct task_struct *kthread_clone(int (*threadfn)(void *data), + void *data, + const char namefmt[], ...) + __attribute__((format(printf, 3, 4))); + /** * kthread_run - create and wake a thread. * @threadfn: the function to run until signal_pending(current). diff --git a/kernel/kthread.c b/kernel/kthread.c index 83911c7..806dae5 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), } EXPORT_SYMBOL(kthread_create); +struct task_struct *kthread_clone(int (*threadfn)(void *data), + void *data, + const char namefmt[], + ...) +{ + struct kthread_create_info create; + int
Re: Virtio network performance poorer than emulated e1000
On Thu, 2010-07-22 at 19:53 -0400, Balachandar wrote: I am resending this email as Freddie didn't use 'reply to all' when replying to this message. I am also updating to answer Freddie's questions.. I can see that virtio network performance is poorer than emaulated e1000 nic. I did some simple ping test and with emulated e1000 the average rtt is around 600 microsec. With virtio the average rtt is 800 microsec. I am using a tap + bridge configuration. I run kvm as follows kvm -m 512 -hda vdisk.img \ -net nic,model=virtio \ -net tap,ifname=tap0,script=qemu-ifup,downscript=no With newer qemu-kvm, you need to use -netdev tap to enable checksum/gso offloads. Try -net nic,model=virtio,netdev=tap0 -netdev tap,ifname=tap0,id=tap0,vhost=on,script=qemu-ifup,downscript=no Thanks Sridhar I am running Debian squeeze distribution with guest and host kernel 2.6.34. Does anyone else see some results like this or is it only me? Could changing the distribution help as i am running a testing one? Actually we are having a custom application which just measures ping-pong latency. The results that i get with virtio is around 750 micro seconds. The result i get with emulated e1000 is around 250 micro sec. Also i tried to use vhost but the virtio latency remained the same. Also i tried the tests with guest and host 2.6.26 kernel. I get better results for virtio than e1000. I get 550 for e1000 and 500 for virtio. Actually my application needs as minimum latency as needed. I am ready to trade-off throughput and cpu utilization. I was very excited when i saw the vhost-net module with people claiming low latencies. I am worried that i am missing some performance offered by virtio and vhost. Thanks, Bala -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: avoid flush under lock
On Thu, 2010-07-15 at 15:19 +0300, Michael S. Tsirkin wrote: We flush under vq mutex when changing backends. This creates a deadlock as workqueue being flushed needs this lock as well. https://bugzilla.redhat.com/show_bug.cgi?id=612421 Drop the vq mutex before flush: we have the device mutex which is sufficient to prevent another ioctl from touching the vq. Why do we need to flush the vq when trying to set the backend and we find that it is already set. Is this just an optimization? Thanks Sridhar Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/net.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 28d7786..50df58e6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) rcu_assign_pointer(vq-private_data, sock); vhost_net_enable_vq(n, vq); done: + mutex_unlock(vq-mutex); + if (oldsock) { vhost_net_flush_vq(n, index); fput(oldsock-file); } + mutex_unlock(n-dev.mutex); + return 0; + err_vq: mutex_unlock(vq-mutex); err: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH repost] sched: export sched_set/getaffinity to modules
On Tue, 2010-07-13 at 14:09 +0300, Michael S. Tsirkin wrote: On Mon, Jul 12, 2010 at 11:59:08PM -0700, Sridhar Samudrala wrote: On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote: On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote: On 07/02, Peter Zijlstra wrote: On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote: Does it (Tejun's kthread_clone() patch) also inherit the cgroup of the caller? Of course, its a simple do_fork() which inherits everything just as you would expect from a similar sys_clone()/sys_fork() call. Yes. And I'm afraid it can inherit more than we want. IIUC, this is called from ioctl(), right? Then the new thread becomes the natural child of the caller, and it shares -mm with the parent. And files, dup_fd() without CLONE_FS. Signals. Say, if you send SIGKILL to this new thread, it can't sleep in TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent just because the parent gets SIGQUIT or abother coredumpable signal. Or the new thread can recieve SIGSTOP via ^Z. Perhaps this is OK, I do not know. Just to remind that kernel_thread() is merely clone(CLONE_VM). Oleg. Right. Doing this might break things like flush. The signal and exit behaviour needs to be examined carefully. I am also unsure whether using such threads might be more expensive than inheriting kthreadd. Should we just leave it to the userspace to set the cgroup/cpumask after qemu starts the guest and the vhost threads? Thanks Sridhar Yes but we can't trust userspace to do this. It's important to do it on thread creation: if we don't, malicious userspace can create large amount of work exceeding the cgroup limits. And the same applies so the affinity: if the qemu process is limited to a set of CPUs, it's important to make the kernel thread that does work our behalf limited to the same set of CPUs. This is not unique to vhost, it's just that virt scenarious are affected by this more: people seem to run untrusted applications and expect the damage to be contained. OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask from the caller. How about an exported kthread function kthread_create_in_current_cg() that does this? diff --git a/include/linux/kthread.h b/include/linux/kthread.h index aabc8a1..e0616f0 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -9,6 +9,9 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), const char namefmt[], ...) __attribute__((format(printf, 3, 4))); +struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data), +void *data, char *name); + /** * kthread_run - create and wake a thread. * @threadfn: the function to run until signal_pending(current). diff --git a/kernel/kthread.c b/kernel/kthread.c index 83911c7..ea4e737 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -15,6 +15,7 @@ #include linux/module.h #include linux/mutex.h #include trace/events/sched.h +#include linux/cgroup.h static DEFINE_SPINLOCK(kthread_create_lock); static LIST_HEAD(kthread_create_list); @@ -149,6 +150,42 @@ struct task_struct *kthread_create(int (*threadfn)(void *data), } EXPORT_SYMBOL(kthread_create); +struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data), +void *data, char *name) +{ + struct task_struct *worker; + cpumask_var_t mask; + int ret = -ENOMEM; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto out_free_mask; + + worker = kthread_create(threadfn, data, %s-%d, name, current-pid); + if (IS_ERR(worker)) + goto out_free_mask; + + ret = sched_getaffinity(current-pid, mask); + if (ret) + goto out_stop_worker; + + ret = sched_setaffinity(worker-pid, mask); + if (ret) + goto out_stop_worker; + + ret = cgroup_attach_task_current_cg(worker); + if (ret) + goto out_stop_worker; + + return worker; + +out_stop_worker: + kthread_stop(worker); +out_free_mask: + free_cpumask_var(mask); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(kthread_create_in_current_cg); + /** * kthread_bind - bind a just-created kthread to a cpu. * @p: thread created by kthread_create(). Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH repost] sched: export sched_set/getaffinity to modules
On 7/14/2010 5:05 PM, Oleg Nesterov wrote: On 07/14, Sridhar Samudrala wrote: OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask from the caller. How about an exported kthread function kthread_create_in_current_cg() that does this? Well. I must admit, this looks a bit strange to me ;) Instead of exporting sched_xxxaffinity() we export the new function which calls them. And I don't think this new helper is very useful in general. May be I am wrong... If we agree on exporting sched_xxxaffinity() functions, we don't need this new kthread function and we can do the same in vhost as the original patch did. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH repost] sched: export sched_set/getaffinity to modules
On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote: On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote: On 07/02, Peter Zijlstra wrote: On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote: Does it (Tejun's kthread_clone() patch) also inherit the cgroup of the caller? Of course, its a simple do_fork() which inherits everything just as you would expect from a similar sys_clone()/sys_fork() call. Yes. And I'm afraid it can inherit more than we want. IIUC, this is called from ioctl(), right? Then the new thread becomes the natural child of the caller, and it shares -mm with the parent. And files, dup_fd() without CLONE_FS. Signals. Say, if you send SIGKILL to this new thread, it can't sleep in TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent just because the parent gets SIGQUIT or abother coredumpable signal. Or the new thread can recieve SIGSTOP via ^Z. Perhaps this is OK, I do not know. Just to remind that kernel_thread() is merely clone(CLONE_VM). Oleg. Right. Doing this might break things like flush. The signal and exit behaviour needs to be examined carefully. I am also unsure whether using such threads might be more expensive than inheriting kthreadd. Should we just leave it to the userspace to set the cgroup/cpumask after qemu starts the guest and the vhost threads? Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH repost] sched: export sched_set/getaffinity to modules
On 7/1/2010 7:55 AM, Peter Zijlstra wrote: On Thu, 2010-07-01 at 16:53 +0200, Tejun Heo wrote: Hello, On 07/01/2010 04:46 PM, Oleg Nesterov wrote: It might be a good idea to make the function take extra clone flags but anyways once created cloned task can be treated the same way as other kthreads, so nothing else needs to be changed. This makes kthread_stop() work. Otherwise the new thread is just the CLONE_VM child of the caller, and the caller is the user-mode task doing ioctl() ? Hmmm, indeed. It makes the attribute inheritance work but circumvents the whole reason there is kthreadd. I thought the whole reason there was threadd was to avoid the inheritance? So avoiding the avoiding of inheritance seems like the goal here, no? I think so. Does it (Tejun's kthread_clone() patch) also inherit the cgroup of the caller? or do we still need the explicit call to attach the thread to the current task's cgroup? I am on vacation next week and cannot look into this until Jul 12. Hope this will be resoved by then. If not, i will look into after i am back. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: break out of polling loop on error
On Sun, 2010-06-27 at 11:59 +0300, Michael S. Tsirkin wrote: When ring parsing fails, we currently handle this as ring empty condition. This means that we enable kicks and recheck ring empty: if this not empty, we re-start polling which of course will fail again. Instead, let's return a negative error code and stop polling. One minor comment on error return below. With that change, Acked-by: Sridhar Samudrala s...@us.ibm.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Dave, I'm sending this out so it can get reviewed. I'll put this on my vhost tree so no need for you to pick this patch directly. drivers/vhost/net.c | 12 ++-- drivers/vhost/vhost.c | 33 + drivers/vhost/vhost.h |8 3 files changed, 31 insertions(+), 22 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 0f41c91..54096ee 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -98,7 +98,8 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; - unsigned head, out, in, s; + unsigned out, in, s; + int head; struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0, @@ -135,6 +136,9 @@ static void handle_tx(struct vhost_net *net) ARRAY_SIZE(vq-iov), out, in, NULL, NULL); + /* On error, stop handling until the next kick. */ + if (head 0) + break; /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { wmem = atomic_read(sock-sk-sk_wmem_alloc); @@ -192,7 +196,8 @@ static void handle_tx(struct vhost_net *net) static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned out, in, log, s; + int head; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -228,6 +233,9 @@ static void handle_rx(struct vhost_net *net) ARRAY_SIZE(vq-iov), out, in, vq_log, log); + /* On error, stop handling until the next kick. */ + if (head 0) + break; /* OK, now we need to know about added descriptors. */ if (head == vq-num) { if (unlikely(vhost_enable_notify(vq))) { diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 3b83382..5ccd384 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -873,12 +873,13 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq, * number of output then some number of input descriptors, it's actually two * iovecs, but we pack them into one and note how many of each there were. * - * This function returns the descriptor number found, or vq-num (which - * is never a valid descriptor number) if none was found. */ -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, -struct iovec iov[], unsigned int iov_size, -unsigned int *out_num, unsigned int *in_num, -struct vhost_log *log, unsigned int *log_num) + * This function returns the descriptor number found, or vq-num (which is + * never a valid descriptor number) if none was found. A negative code is + * returned on error. */ +int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, + struct iovec iov[], unsigned int iov_size, + unsigned int *out_num, unsigned int *in_num, + struct vhost_log *log, unsigned int *log_num) { struct vring_desc desc; unsigned int i, head, found = 0; @@ -890,13 +891,13 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, if (get_user(vq-avail_idx, vq-avail-idx)) { vq_err(vq, Failed to access avail idx at %p\n, vq-avail-idx); - return vq-num; + return -EFAULT; } if ((u16)(vq-avail_idx - last_avail_idx) vq-num) { vq_err(vq, Guest moved used index from %u to %u, last_avail_idx, vq-avail_idx); - return vq-num; + return -EFAULT; This should be -EINVAL } /* If there's nothing new since last we looked, return invalid. */ @@ -912,14 +913,14 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq, vq_err(vq, Failed to read head: idx %d address %p\n
Re: [PATCHv2] vhost-net: add dhclient work-around from userspace
On Mon, 2010-06-28 at 13:08 +0300, Michael S. Tsirkin wrote: Userspace virtio server has the following hack so guests rely on it, and we have to replicate it, too: Use port number to detect incoming IPv4 DHCP response packets, and fill in the checksum for these. The issue we are solving is that on linux guests, some apps that use recvmsg with AF_PACKET sockets, don't know how to handle CHECKSUM_PARTIAL; The interface to return the relevant information was added in 8dc4194474159660d7f37c495e3fc3f10d0db8cc, and older userspace does not use it. One important user of recvmsg with AF_PACKET is dhclient, so we add a work-around just for DHCP. Don't bother applying the hack to IPv6 as userspace virtio does not have a work-around for that - let's hope guests will do the right thing wrt IPv6. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Dave, I'm going to put this patch on the vhost tree, no need for you to bother merging it - you'll get it with a pull request. drivers/vhost/net.c | 44 +++- 1 files changed, 43 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index cc19595..03bba6a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,10 @@ #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/ip.h +#include linux/udp.h +#include linux/netdevice.h + #include net/sock.h #include vhost.h @@ -186,6 +190,44 @@ static void handle_tx(struct vhost_net *net) unuse_mm(net-dev.mm); } +static int peek_head(struct sock *sk) This routine is doing more than just peeking the head of sk's receive queue. May be this should be named similar to what qemu calls 'work_around_broken_dhclient()' +{ + struct sk_buff *skb; + + lock_sock(sk); + skb = skb_peek(sk-sk_receive_queue); + if (unlikely(!skb)) { + release_sock(sk); + return 0; + } + /* Userspace virtio server has the following hack so + * guests rely on it, and we have to replicate it, too: */ + /* Use port number to detect incoming IPv4 DHCP response packets, + * and fill in the checksum. */ + + /* The issue we are solving is that on linux guests, some apps + * that use recvmsg with AF_PACKET sockets, don't know how to + * handle CHECKSUM_PARTIAL; + * The interface to return the relevant information was added in + * 8dc4194474159660d7f37c495e3fc3f10d0db8cc, + * and older userspace does not use it. + * One important user of recvmsg with AF_PACKET is dhclient, + * so we add a work-around just for DHCP. */ + if (skb-ip_summed == CHECKSUM_PARTIAL + skb_headlen(skb) = skb_transport_offset(skb) + + sizeof(struct udphdr) + udp_hdr(skb)-dest == htons(68) + skb_network_header_len(skb) = sizeof(struct iphdr) + ip_hdr(skb)-protocol == IPPROTO_UDP + skb-protocol == htons(ETH_P_IP)) { Isn't it more logical to check for skb-protocol, followed by ip_hdr and then udp_hdr? + skb_checksum_help(skb); + /* Restore ip_summed value: tun passes it to user. */ + skb-ip_summed = CHECKSUM_PARTIAL; + } + release_sock(sk); + return 1; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) @@ -222,7 +264,7 @@ static void handle_rx(struct vhost_net *net) vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { + while (peek_head(sock-sk)) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), out, in, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers
On Thu, 2010-06-24 at 11:11 +0300, Michael S. Tsirkin wrote: On Sun, May 30, 2010 at 10:25:01PM +0200, Tejun Heo wrote: Apply the cpumask and cgroup of the initializing task to the created vhost poller. Based on Sridhar Samudrala's patch. Cc: Michael S. Tsirkin m...@redhat.com Cc: Sridhar Samudrala samudrala.srid...@gmail.com I wanted to apply this, but modpost fails: ERROR: sched_setaffinity [drivers/vhost/vhost_net.ko] undefined! ERROR: sched_getaffinity [drivers/vhost/vhost_net.ko] undefined! Did you try building as a module? In my original implementation, i had these calls in workqueue.c. Now that these are moved to vhost.c which can be built as a module, these symbols need to be exported. The following patch fixes the build issue with vhost as a module. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/kernel/sched.c b/kernel/sched.c index 3c2a54f..15a0c6f 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4837,6 +4837,7 @@ out_put_task: put_online_cpus(); return retval; } +EXPORT_SYMBOL_GPL(sched_setaffinity); static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len, struct cpumask *new_mask) @@ -4900,6 +4901,7 @@ out_unlock: return retval; } +EXPORT_SYMBOL_GPL(sched_getaffinity); /** * sys_sched_getaffinity - get the cpu affinity of a process --- drivers/vhost/vhost.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) Index: work/drivers/vhost/vhost.c === --- work.orig/drivers/vhost/vhost.c +++ work/drivers/vhost/vhost.c @@ -23,6 +23,7 @@ #include linux/highmem.h #include linux/slab.h #include linux/kthread.h +#include linux/cgroup.h #include linux/net.h #include linux/if_packet.h @@ -176,12 +177,30 @@ repeat: long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs) { - struct task_struct *poller; - int i; + struct task_struct *poller = NULL; + cpumask_var_t mask; + int i, ret = -ENOMEM; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto out; poller = kthread_create(vhost_poller, dev, vhost-%d, current-pid); - if (IS_ERR(poller)) - return PTR_ERR(poller); + if (IS_ERR(poller)) { + ret = PTR_ERR(poller); + goto out; + } + + ret = sched_getaffinity(current-pid, mask); + if (ret) + goto out; + + ret = sched_setaffinity(poller-pid, mask); + if (ret) + goto out; + + ret = cgroup_attach_task_current_cg(poller); + if (ret) + goto out; dev-vqs = vqs; dev-nvqs = nvqs; @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de vhost_poll_init(dev-vqs[i].poll, dev-vqs[i].handle_kick, POLLIN, dev); } - return 0; + + wake_up_process(poller);/* avoid contributing to loadavg */ + ret = 0; +out: + if (ret) + kthread_stop(poller); + free_cpumask_var(mask); + return ret; } /* Caller should have device mutex */ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix bugs in msix_set/unset_mask_notifier() routines
I am hitting the following assertions in msix.c when doing a guest reboot or live migration using vhost. qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed. qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion `dev-msix_mask_notifier_opaque[vector]' failed. The following patch fixes the bugs in handling msix_is_masked() condition in msix_set/unset_mask_notifier() routines. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/msix.c b/hw/msix.c index 1398680..a191df1 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -609,7 +609,7 @@ void msix_unuse_all_vectors(PCIDevice *dev) int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) { -int r; +int r = 0; if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector]) return 0; @@ -619,13 +619,15 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) /* Unmask the new notifier unless vector is masked. */ if (msix_is_masked(dev, vector)) { -return 0; +goto out; } r = dev-msix_mask_notifier(dev, vector, opaque, msix_is_masked(dev, vector)); if (r 0) { return r; } + +out: dev-msix_mask_notifier_opaque[vector] = opaque; return r; } @@ -640,8 +642,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) assert(dev-msix_mask_notifier_opaque[vector]); /* Mask the old notifier unless it is already masked. */ -if (msix_is_masked(dev, vector)) { -return 0; +if (!msix_is_masked(dev, vector)) { +goto out; } r = dev-msix_mask_notifier(dev, vector, dev-msix_mask_notifier_opaque[vector], @@ -649,6 +651,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) if (r 0) { return r; } + +out: dev-msix_mask_notifier_opaque[vector] = NULL; return r; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msix: fix msix_set/unset_mask_notifier
On Wed, 2010-06-02 at 20:49 +0300, Michael S. Tsirkin wrote: Sridhar Samudrala reported hitting the following assertions in msix.c when doing a guest reboot or live migration using vhost. qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed. qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion `dev-msix_mask_notifier_opaque[vector]' failed. The issue is that we didn't clear/set the opaque pointer when vector is masked. The following patch fixes this. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Sridhar, could you test the following please? Yes. looks good and works fine for me. Tested-by: Sridhar Samudrala s...@us.ibm.com hw/msix.c | 33 - 1 files changed, 16 insertions(+), 17 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index b400769..24ff6ae 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -610,7 +610,7 @@ void msix_unuse_all_vectors(PCIDevice *dev) int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) { -int r; +int r = 0; if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector]) return 0; @@ -619,13 +619,11 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) assert(!dev-msix_mask_notifier_opaque[vector]); /* Unmask the new notifier unless vector is masked. */ -if (msix_is_masked(dev, vector)) { -return 0; -} -r = dev-msix_mask_notifier(dev, vector, opaque, -msix_is_masked(dev, vector)); -if (r 0) { -return r; +if (!msix_is_masked(dev, vector)) { +r = dev-msix_mask_notifier(dev, vector, opaque, false); +if (r 0) { +return r; +} } dev-msix_mask_notifier_opaque[vector] = opaque; return r; @@ -634,21 +632,21 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) { int r = 0; +void *opaque; if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector]) return 0; +opaque = dev-msix_mask_notifier_opaque[vector]; + assert(dev-msix_mask_notifier); -assert(dev-msix_mask_notifier_opaque[vector]); +assert(opaque); /* Mask the old notifier unless it is already masked. */ -if (msix_is_masked(dev, vector)) { -return 0; -} -r = dev-msix_mask_notifier(dev, vector, -dev-msix_mask_notifier_opaque[vector], -!msix_is_masked(dev, vector)); -if (r 0) { -return r; +if (!msix_is_masked(dev, vector)) { +r = dev-msix_mask_notifier(dev, vector, opaque, true); +if (r 0) { +return r; +} } dev-msix_mask_notifier_opaque[vector] = NULL; return r; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
On Tue, 2010-06-01 at 11:35 +0200, Tejun Heo wrote: Apply the cpumask and cgroup of the initializing task to the created vhost worker. Based on Sridhar Samudrala's patch. Li Zefan spotted a bug in error path (twice), fixed (twice). Signed-off-by: Tejun Heo t...@kernel.org Cc: Michael S. Tsirkin m...@redhat.com Cc: Sridhar Samudrala samudrala.srid...@gmail.com Cc: Li Zefan l...@cn.fujitsu.com --- drivers/vhost/vhost.c | 34 ++ 1 file changed, 30 insertions(+), 4 deletions(-) Index: work/drivers/vhost/vhost.c === --- work.orig/drivers/vhost/vhost.c +++ work/drivers/vhost/vhost.c @@ -23,6 +23,7 @@ #include linux/highmem.h #include linux/slab.h #include linux/kthread.h +#include linux/cgroup.h #include linux/net.h #include linux/if_packet.h @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de struct vhost_virtqueue *vqs, int nvqs) { struct task_struct *worker; - int i; + cpumask_var_t mask; + int i, ret = -ENOMEM; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto out_free_mask; I think this is another bug in the error path. You should simply do a return instead of a goto here when aloc_cpu_mask fails. Thanks Sridhar worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid); - if (IS_ERR(worker)) - return PTR_ERR(worker); + if (IS_ERR(worker)) { + ret = PTR_ERR(worker); + goto out_free_mask; + } + + ret = sched_getaffinity(current-pid, mask); + if (ret) + goto out_stop_worker; + + ret = sched_setaffinity(worker-pid, mask); + if (ret) + goto out_stop_worker; + + ret = cgroup_attach_task_current_cg(worker); + if (ret) + goto out_stop_worker; dev-vqs = vqs; dev-nvqs = nvqs; @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de } wake_up_process(worker);/* avoid contributing to loadavg */ - return 0; + ret = 0; + goto out_free_mask; + +out_stop_worker: + kthread_stop(worker); +out_free_mask: + free_cpumask_var(mask); + return ret; } /* Caller should have device mutex */ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote: On 05/27, Michael S. Tsirkin wrote: On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? I don't understand the reasons for this patch, but this doesn't matter. I don't really see any need to change workqueue.c, +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq) +{ + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread; +} (Not sure this trivial static helper with the single caller makes sense, but see below) +/* Create a singlethread workqueue and attach it's task to the current task's + * cgroup and set it's cpumask to the current task's cpumask. + */ +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) +{ + struct workqueue_struct *wq; + struct task_struct *task; + cpumask_var_t mask; + + wq = create_singlethread_workqueue(name); + if (!wq) + goto out; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto err; + + if (sched_getaffinity(current-pid, mask)) + goto err; + + task = get_singlethread_wq_task(wq); + if (sched_setaffinity(task-pid, mask)) + goto err; + + if (cgroup_attach_task_current_cg(task)) + goto err; +out: + return wq; +err: + destroy_workqueue(wq); + wq = NULL; + goto out; +} Instead, cgroup.c (or whoever needs this) can do struct move_struct { struct work_struct work; int ret; }; static void move_func(struct work_struct *work) { struct move_struct *move = container_of(...); if (cgroup_attach_task_current_cg(current)) We are trying to attach the task associated with workqueue to the current task's cgroup. So what we need is cgroup_attach_task_current_cg(wq-task); However there is no interface currently that exports the task associated with a workqueue. It is hidden in cpu_workqueue_struct and is only accessible within workqueue.c. ret = -EANY; } static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) { struct workqueue_struct *wq; struct move_struct move = { .work = __WORK_INITIALIZER(move_func); }; wq = create_singlethread_workqueue(name); if (!wq) return NULL; queue_work(move.work); flush_work(move.work); if (move.ret) { destroy_workqueue(wq); wq = NULL; } return wq; } Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and use it like the patch does. This requires that struct cpu_workqueue_struct and struct workqueue_struct are made externally visible by moving them to kernel/workqueue.h. Instead what about adding the simple helper get_singlethread_wq_task() in workqueue.c and exporting it. I can add create_singlethread_workqueue_in_current_cg() to cgroup.c using this helper routine. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup
On 5/20/2010 3:22 PM, Paul Menage wrote: On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala samudrala.srid...@gmail.com wrote: Add a new kernel API to attach a task to current task's cgroup in all the active hierarchies. Signed-off-by: Sridhar Samudralas...@us.ibm.com Reviewed-by: Paul Menagemen...@google.com It would be more efficient to just attach directly to current-cgroups rather than potentially creating/destroying one css_set for each hierarchy until we've completely converged on current-cgroups - but that would require a bunch of refactoring of the guts of cgroup_attach_task() to ensure that the right can_attach()/attach() callbacks are made. That doesn't really seem worthwhile right now for the initial use, that I imagine isn't going to be performance-sensitive. Yes. In our use-case, this will be called only once per guest interface when the guest comes up. Hope you or someone more familiar with cgroups subsystem can optimize this function later. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup
Add a new kernel API to attach a task to current task's cgroup in all the active hierarchies. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp, void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it); int cgroup_scan_tasks(struct cgroup_scanner *scan); int cgroup_attach_task(struct cgroup *, struct task_struct *); +int cgroup_attach_task_current_cg(struct task_struct *); /* * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 6d870f2..6cfeb06 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1788,6 +1788,29 @@ out: return retval; } +/** + * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup + * @tsk: the task to be attached + */ +int cgroup_attach_task_current_cg(struct task_struct *tsk) +{ + struct cgroupfs_root *root; + struct cgroup *cur_cg; + int retval = 0; + + cgroup_lock(); + for_each_active_root(root) { + cur_cg = task_cgroup_from_root(current, root); + retval = cgroup_attach_task(cur_cg, tsk); + if (retval) + break; + } + cgroup_unlock(); + + return retval; +} +EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg); + /* * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex * held. May take task_lock of task -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ PATCH 3/3] vhost: make it more scalable by creating a vhost thread per device
Make vhost more scalable by creating a separate vhost thread per vhost device. This provides better scaling across virtio-net interfaces in multiple guests. Also attach each vhost thread to the cgroup and cpumask of the associated guest(qemu or libvirt). Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9777583..18bf5be 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -329,19 +329,22 @@ static void handle_rx_net(struct work_struct *work) static int vhost_net_open(struct inode *inode, struct file *f) { struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL); + struct vhost_dev *dev; int r; if (!n) return -ENOMEM; + + dev = n-dev; n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; - r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX); + r = vhost_dev_init(dev, n-vqs, VHOST_NET_VQ_MAX); if (r 0) { kfree(n); return r; } - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); n-tx_poll_state = VHOST_NET_POLL_DISABLED; f-private_data = n; @@ -644,25 +647,14 @@ static struct miscdevice vhost_net_misc = { int vhost_net_init(void) { - int r = vhost_init(); - if (r) - goto err_init; - r = misc_register(vhost_net_misc); - if (r) - goto err_reg; - return 0; -err_reg: - vhost_cleanup(); -err_init: - return r; - + return misc_register(vhost_net_misc); } + module_init(vhost_net_init); void vhost_net_exit(void) { misc_deregister(vhost_net_misc); - vhost_cleanup(); } module_exit(vhost_net_exit); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 49fa953..b076b9b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -37,8 +37,6 @@ enum { VHOST_MEMORY_F_LOG = 0x1, }; -static struct workqueue_struct *vhost_workqueue; - static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt) { @@ -57,18 +55,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, if (!((unsigned long)key poll-mask)) return 0; - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); return 0; } /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask) +unsigned long mask, struct vhost_dev *dev) { INIT_WORK(poll-work, func); init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup); init_poll_funcptr(poll-table, vhost_poll_func); poll-mask = mask; + poll-dev = dev; } /* Start polling a file. We add ourselves to file's wait queue. The caller must @@ -97,7 +96,7 @@ void vhost_poll_flush(struct vhost_poll *poll) void vhost_poll_queue(struct vhost_poll *poll) { - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -129,6 +128,13 @@ long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs) { int i; + char vhost_name[20]; + + snprintf(vhost_name, 20, vhost-%d, current-pid); + dev-wq = create_singlethread_workqueue_in_current_cg(vhost_name); + if (!dev-wq) + return -ENOMEM; + dev-vqs = vqs; dev-nvqs = nvqs; mutex_init(dev-mutex); @@ -144,7 +150,7 @@ long vhost_dev_init(struct vhost_dev *dev, if (dev-vqs[i].handle_kick) vhost_poll_init(dev-vqs[i].poll, dev-vqs[i].handle_kick, - POLLIN); + POLLIN, dev); } return 0; } @@ -217,6 +223,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev) if (dev-mm) mmput(dev-mm); dev-mm = NULL; + + destroy_workqueue(dev-wq); } static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz) @@ -1105,16 +1113,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq) vq_err(vq, Failed to enable notification at %p: %d\n, vq-used-flags, r); } - -int vhost_init(void) -{ - vhost_workqueue = create_singlethread_workqueue(vhost); - if (!vhost_workqueue) - return -ENOMEM; - return 0; -} - -void vhost_cleanup(void) -{ - destroy_workqueue(vhost_workqueue); -} diff --git a/drivers/vhost/vhost.h
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On Sun, 2010-04-11 at 18:47 +0300, Michael S. Tsirkin wrote: On Thu, Apr 08, 2010 at 05:05:42PM -0700, Sridhar Samudrala wrote: On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote: On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote: On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote: Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. Thanks for looking into this. An alternative approach is to simply replace create_singlethread_workqueue with create_workqueue which would get us a thread per host CPU. It seems that in theory this should be the optimal approach wrt CPU locality, however, in practice a single thread seems to get better numbers. I have a TODO to investigate this. Could you try looking into this? Yes. I tried using create_workqueue(), but the results were not good atleast when the number of guest interfaces is less than the number of CPUs. I didn't try more than 8 guests. Creating a separate thread per guest interface seems to be more scalable based on the testing i have done so far. I will try some more tests and get some numbers to compare the following 3 options. - single vhost thread - vhost thread per cpu - vhost thread per guest virtio interface Here are the results with netperf TCP_STREAM 64K guest to host on a 8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host CPU utilization. Current default single vhost thread --- 1 guest: 12500 37% 2 guests: 12800 46% 3 guests: 12600 47% 4 guests: 12200 47% 5 guests: 12000 47% 6 guests: 11700 47% 7 guests: 11340 47% 8 guests: 11200 48% vhost thread per cpu 1 guest: 4900 25% 2 guests: 10800 49% 3 guests: 17100 67% 4 guests: 20400 84% 5 guests: 21000 90% 6 guests: 22500 92% 7 guests: 23500 96% 8 guests: 24500 99% vhost thread per guest interface 1 guest: 12500 37% 2 guests: 21000 72% 3 guests: 21600 79% 4 guests: 21600 85% 5 guests: 22500 89% 6 guests: 22800 94% 7 guests: 24500 98% 8 guests: 26400 99% Thanks Sridhar Consider using Ingo's perf tool to get error bars, but looks good overall. What do you mean by getting error bars? One thing I note though is that we seem to be able to consume up to 99% CPU now. So I think with this approach we can no longer claim that we are just like some other parts of networking stack, doing work outside any cgroup, and we should make the vhost thread inherit the cgroup and cpu mask from the process calling SET_OWNER. Yes. I am not sure what is the right interface to do this, but this should also allow binding qemu to a set of cpus and automatically having vhost thread inherit the same cpu mask. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On Thu, 2010-04-08 at 17:14 -0700, Rick Jones wrote: Here are the results with netperf TCP_STREAM 64K guest to host on a 8-cpu Nehalem system. I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX? Yes. It is a 2 socket quad-core Nehalem. so i guess it is a 8 core Nehalem-EP. Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-) but I find it incomplete and also like to run things like single-instance TCP_RR and multiple-instance, multiple transaction (./configure --enable-burst) TCP_RR tests, particularly when concerned with scaling issues. Can we run multiple instance and multiple transaction tests with a single netperf commandline? Is there any easy way to get consolidated throughput when a netserver on the host is servicing netperf clients from multiple guests? Thanks Sridhar happy benchmarking, rick jones It shows cumulative bandwidth in Mbps and host CPU utilization. Current default single vhost thread --- 1 guest: 12500 37% 2 guests: 12800 46% 3 guests: 12600 47% 4 guests: 12200 47% 5 guests: 12000 47% 6 guests: 11700 47% 7 guests: 11340 47% 8 guests: 11200 48% vhost thread per cpu 1 guest: 4900 25% 2 guests: 10800 49% 3 guests: 17100 67% 4 guests: 20400 84% 5 guests: 21000 90% 6 guests: 22500 92% 7 guests: 23500 96% 8 guests: 24500 99% vhost thread per guest interface 1 guest: 12500 37% 2 guests: 21000 72% 3 guests: 21600 79% 4 guests: 21600 85% 5 guests: 22500 89% 6 guests: 22800 94% 7 guests: 24500 98% 8 guests: 26400 99% Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote: On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote: On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote: Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. Thanks for looking into this. An alternative approach is to simply replace create_singlethread_workqueue with create_workqueue which would get us a thread per host CPU. It seems that in theory this should be the optimal approach wrt CPU locality, however, in practice a single thread seems to get better numbers. I have a TODO to investigate this. Could you try looking into this? Yes. I tried using create_workqueue(), but the results were not good atleast when the number of guest interfaces is less than the number of CPUs. I didn't try more than 8 guests. Creating a separate thread per guest interface seems to be more scalable based on the testing i have done so far. I will try some more tests and get some numbers to compare the following 3 options. - single vhost thread - vhost thread per cpu - vhost thread per guest virtio interface Here are the results with netperf TCP_STREAM 64K guest to host on a 8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host CPU utilization. Current default single vhost thread --- 1 guest: 12500 37% 2 guests: 12800 46% 3 guests: 12600 47% 4 guests: 12200 47% 5 guests: 12000 47% 6 guests: 11700 47% 7 guests: 11340 47% 8 guests: 11200 48% vhost thread per cpu 1 guest: 4900 25% 2 guests: 10800 49% 3 guests: 17100 67% 4 guests: 20400 84% 5 guests: 21000 90% 6 guests: 22500 92% 7 guests: 23500 96% 8 guests: 24500 99% vhost thread per guest interface 1 guest: 12500 37% 2 guests: 21000 72% 3 guests: 21600 79% 4 guests: 21600 85% 5 guests: 22500 89% 6 guests: 22800 94% 7 guests: 24500 98% 8 guests: 26400 99% Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote: On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote: Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. Thanks for looking into this. An alternative approach is to simply replace create_singlethread_workqueue with create_workqueue which would get us a thread per host CPU. It seems that in theory this should be the optimal approach wrt CPU locality, however, in practice a single thread seems to get better numbers. I have a TODO to investigate this. Could you try looking into this? Yes. I tried using create_workqueue(), but the results were not good atleast when the number of guest interfaces is less than the number of CPUs. I didn't try more than 8 guests. Creating a separate thread per guest interface seems to be more scalable based on the testing i have done so far. I will try some more tests and get some numbers to compare the following 3 options. - single vhost thread - vhost thread per cpu - vhost thread per guest virtio interface Thanks Sridhar I am seeing better aggregated througput/latency when running netperf across multiple guests or multiple interfaces in a guest in parallel with this patch. Any numbers? What happens to CPU utilization? Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index a6a88df..29aa80f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) return r; } - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + n-dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + n-dev); n-tx_poll_state = VHOST_NET_POLL_DISABLED; f-private_data = n; @@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = { int vhost_net_init(void) { - int r = vhost_init(); - if (r) - goto err_init; - r = misc_register(vhost_net_misc); - if (r) - goto err_reg; - return 0; -err_reg: - vhost_cleanup(); -err_init: - return r; - + return misc_register(vhost_net_misc); } + module_init(vhost_net_init); void vhost_net_exit(void) { misc_deregister(vhost_net_misc); - vhost_cleanup(); } module_exit(vhost_net_exit); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7bd7a1e..243f4d3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -36,8 +36,6 @@ enum { VHOST_MEMORY_F_LOG = 0x1, }; -static struct workqueue_struct *vhost_workqueue; - static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt) { @@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, if (!((unsigned long)key poll-mask)) return 0; - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); return 0; } /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask) +unsigned long mask, struct vhost_dev *dev) { INIT_WORK(poll-work, func); init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup); init_poll_funcptr(poll-table, vhost_poll_func); poll-mask = mask; + poll-dev = dev; } /* Start polling a file. We add ourselves to file's wait queue. The caller must @@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll) void vhost_poll_queue(struct vhost_poll *poll) { - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs) { int i; + + dev-wq = create_singlethread_workqueue(vhost); + if (!dev-wq) + return -ENOMEM; + dev-vqs = vqs; dev-nvqs = nvqs; mutex_init(dev-mutex); @@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev, if (dev-vqs[i].handle_kick) vhost_poll_init(dev-vqs[i].poll, dev-vqs[i].handle_kick, - POLLIN); + POLLIN, dev); } return 0; } @@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev) if (dev-mm) mmput(dev-mm); dev-mm = NULL
[PATCH] vhost: Make it more scalable by creating a vhost thread per device.
Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. I am seeing better aggregated througput/latency when running netperf across multiple guests or multiple interfaces in a guest in parallel with this patch. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index a6a88df..29aa80f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) return r; } - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + n-dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + n-dev); n-tx_poll_state = VHOST_NET_POLL_DISABLED; f-private_data = n; @@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = { int vhost_net_init(void) { - int r = vhost_init(); - if (r) - goto err_init; - r = misc_register(vhost_net_misc); - if (r) - goto err_reg; - return 0; -err_reg: - vhost_cleanup(); -err_init: - return r; - + return misc_register(vhost_net_misc); } + module_init(vhost_net_init); void vhost_net_exit(void) { misc_deregister(vhost_net_misc); - vhost_cleanup(); } module_exit(vhost_net_exit); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7bd7a1e..243f4d3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -36,8 +36,6 @@ enum { VHOST_MEMORY_F_LOG = 0x1, }; -static struct workqueue_struct *vhost_workqueue; - static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt) { @@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, if (!((unsigned long)key poll-mask)) return 0; - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); return 0; } /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask) +unsigned long mask, struct vhost_dev *dev) { INIT_WORK(poll-work, func); init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup); init_poll_funcptr(poll-table, vhost_poll_func); poll-mask = mask; + poll-dev = dev; } /* Start polling a file. We add ourselves to file's wait queue. The caller must @@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll) void vhost_poll_queue(struct vhost_poll *poll) { - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs) { int i; + + dev-wq = create_singlethread_workqueue(vhost); + if (!dev-wq) + return -ENOMEM; + dev-vqs = vqs; dev-nvqs = nvqs; mutex_init(dev-mutex); @@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev, if (dev-vqs[i].handle_kick) vhost_poll_init(dev-vqs[i].poll, dev-vqs[i].handle_kick, - POLLIN); + POLLIN, dev); } return 0; } @@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev) if (dev-mm) mmput(dev-mm); dev-mm = NULL; + + destroy_workqueue(dev-wq); } static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz) @@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq) vq_err(vq, Failed to enable notification at %p: %d\n, vq-used-flags, r); } - -int vhost_init(void) -{ - vhost_workqueue = create_singlethread_workqueue(vhost); - if (!vhost_workqueue) - return -ENOMEM; - return 0; -} - -void vhost_cleanup(void) -{ - destroy_workqueue(vhost_workqueue); -} diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 44591ba..60fefd0 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -29,10 +29,11 @@ struct vhost_poll { /* struct which will handle all actual work. */ struct work_structwork; unsigned long mask; + struct vhost_dev *dev; }; void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask); +unsigned long mask, struct vhost_dev *dev
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
On Fri, 2010-04-02 at 15:25 +0800, xiaohui@intel.com wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. What is the advantage of this approach compared to PCI-passthrough of the host NIC to the guest? Does this require pinning of the entire guest memory? Or only the send/receive buffers? Thanks Sridhar The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the
Re: [PATCH] kvm: Increase NR_IOBUS_DEVS limit to 200
On Wed, 2010-03-31 at 12:51 +0300, Michael S. Tsirkin wrote: On Tue, Mar 30, 2010 at 04:48:25PM -0700, Sridhar Samudrala wrote: This patch increases the current hardcoded limit of NR_IOBUS_DEVS from 6 to 200. We are hitting this limit when creating a guest with more than 1 virtio-net device using vhost-net backend. Each virtio-net device requires 2 such devices to service notifications from rx/tx queues. Signed-off-by: Sridhar Samudrala s...@us.ibm.com I tried this, but observed a measurable performance degradation with vhost net and this patch. Did not investigate yet. Do you see this as well? No. I am not seeing any degradation that is not within the range of variation seen with kvm networking. For ex: On a Nehalem 2 socket, 8 core system, running netperf TCP_STREAM with 64K size packets, i am getting 12-13Gb/s guest to host and 11-12Gb/s host to guest with and without this patch. Thanks Sridhar diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a3fd0f9..7fb48d3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -54,7 +54,7 @@ extern struct kmem_cache *kvm_vcpu_cache; */ struct kvm_io_bus { int dev_count; -#define NR_IOBUS_DEVS 6 +#define NR_IOBUS_DEVS 200 struct kvm_io_device *devs[NR_IOBUS_DEVS]; }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Increase NR_IOBUS_DEVS limit to 200
This patch increases the current hardcoded limit of NR_IOBUS_DEVS from 6 to 200. We are hitting this limit when creating a guest with more than 1 virtio-net device using vhost-net backend. Each virtio-net device requires 2 such devices to service notifications from rx/tx queues. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a3fd0f9..7fb48d3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -54,7 +54,7 @@ extern struct kmem_cache *kvm_vcpu_cache; */ struct kvm_io_bus { int dev_count; -#define NR_IOBUS_DEVS 6 +#define NR_IOBUS_DEVS 200 struct kvm_io_device *devs[NR_IOBUS_DEVS]; }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to create more than 1 guest virtio-net device using vhost-net backend
On Mon, 2010-03-22 at 20:16 +0200, Michael S. Tsirkin wrote: On Sun, Mar 21, 2010 at 01:58:29PM +0200, Avi Kivity wrote: On 03/21/2010 01:34 PM, Michael S. Tsirkin wrote: On Sun, Mar 21, 2010 at 12:29:31PM +0200, Avi Kivity wrote: On 03/21/2010 12:15 PM, Michael S. Tsirkin wrote: Nothing easy that I can see. Each device needs 2 of these. Avi, Gleb, any objections to increasing the limit to say 16? That would give us 5 more devices to the limit of 6 per guest. Increase it to 200, then. OK. I think we'll also need a smarter allocator than bus-dev_count++ than we now have. Right? No, why? We'll run into problems if devices are created/removed in random order, won't we? unregister_dev() takes care of it. Eventually we'll want faster scanning than the linear search we employ now, though. Yes I suspect with 200 entries we will :). Let's just make it 16 for now? Let's make it 200 and fix the performance problems later. Making it 16 is just asking for trouble. I did this and performance with vhost seems to become much more noisy, and drop by about 10% on average, even though in practice only a single device is created. Still trying to figure it out ... Any idea? I am not sure if this 10% variation is due to the increase of NR_IO_BUS_DEVS. In our testing, we do see variations of 10% or higher between multiple netperf instances with the same setup/configuration when using virtio/vhost. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vlan disable TSO of virtio in KVM
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote: Hi, I read some kernel source. My basic understanding is that, in net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig created interface is defined to be dev-features |= real_dev-features real_dev-vlan_features; However, in drivers/net/virtio_net.c, vlan_features are never set (I will assume it would be 0). So, dev-features will be 0 for the ethX.vid interface. I verify it using ethtool -k on each KVM. # ethtool -k eth0 shows that rx/tx csum, sg, tso, gso are on # ethtool -k eth0.3003 all offloading features are off. I think that is why TSO is never enabled when running large package traffic between two vlan interfaces of different KVMs. I also took a look at VMware's pv implementation, which is drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features when probing, by netdev-vlan_features = netdev-features; I was wondering why vlan_features was not defined in virtio_net. Is it a BUG? Or, it is due to some constraints? Could any explain that? I saw the same issue some time back and submitted a couple of patches to address it, but were not accepted as the fix is not done at the right place. Not sure if we can do this right without updating virtio_net_hdr with vlan specific info. http://thread.gmane.org/gmane.linux.network/150197/focus=150838 http://thread.gmane.org/gmane.linux.network/150198/focus=150837 Thanks Sridhar Thanks, -D On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com wrote: Hi all, I have been deploying two KVMs on my Debian testing box. Two KVMs each use one tap device connecting to the host. When I doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0) using ethX on them, I could verify that TSO did happened by # tcpdump -nt -i tap1 I can see messages like, IP 192.168.101.2.39994 192.168.101.1.41316: Flags [P.], seq 7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr 874803], length 5792 So, according the 'length', skb didn't get segmented. However, When I (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface eth1.3003, eth0.3003 on both machines. (2) netperf between two new interface, TSO no longer showed up, # tcpdump -nt -i tap1 I only got, vlan 3003, p 0, IP 10.214.10.2.42324 10.214.10.1.56460: Flags [P.], seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr 926686], length 1448 So, all the large packages get segmented in virtio (is that right?) My KVM command line options are, kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56 -net tap,ifname=$TAP,script=no My question is whether it is the expected behavior? Can VLAN tagging coexist with TSO in virtio_net driver? If this is not desired result. Any hint for fixing the problem? Thanks -D -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vlan disable TSO of virtio in KVM
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote: Hi, I read some kernel source. My basic understanding is that, in net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig created interface is defined to be dev-features |= real_dev-features real_dev-vlan_features; However, in drivers/net/virtio_net.c, vlan_features are never set (I will assume it would be 0). So, dev-features will be 0 for the ethX.vid interface. I verify it using ethtool -k on each KVM. # ethtool -k eth0 shows that rx/tx csum, sg, tso, gso are on # ethtool -k eth0.3003 all offloading features are off. I think that is why TSO is never enabled when running large package traffic between two vlan interfaces of different KVMs. I also took a look at VMware's pv implementation, which is drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features when probing, by netdev-vlan_features = netdev-features; I was wondering why vlan_features was not defined in virtio_net. Is it a BUG? Or, it is due to some constraints? Could any explain that? I saw the same issue some time back and submitted a couple of patches to address it, but were not accepted as the fix is not done at the right place. Not sure if we can do this right without updating virtio_net_hdr with vlan specific info. http://thread.gmane.org/gmane.linux.network/150197/focus=150838 http://thread.gmane.org/gmane.linux.network/150198/focus=150837 Thanks Sridhar Thanks, -D On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com wrote: Hi all, I have been deploying two KVMs on my Debian testing box. Two KVMs each use one tap device connecting to the host. When I doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0) using ethX on them, I could verify that TSO did happened by # tcpdump -nt -i tap1 I can see messages like, IP 192.168.101.2.39994 192.168.101.1.41316: Flags [P.], seq 7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr 874803], length 5792 So, according the 'length', skb didn't get segmented. However, When I (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface eth1.3003, eth0.3003 on both machines. (2) netperf between two new interface, TSO no longer showed up, # tcpdump -nt -i tap1 I only got, vlan 3003, p 0, IP 10.214.10.2.42324 10.214.10.1.56460: Flags [P.], seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr 926686], length 1448 So, all the large packages get segmented in virtio (is that right?) My KVM command line options are, kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56 -net tap,ifname=$TAP,script=no My question is whether it is the expected behavior? Can VLAN tagging coexist with TSO in virtio_net driver? If this is not desired result. Any hint for fixing the problem? Thanks -D -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio GSO makes IPv6 unbearably slow
On 2/1/2010 11:50 AM, Bernhard Schmidt wrote: Hi, I have a really weird issue all of the sudden on _one_ of my two KVM hosts. The other one, while running on a different hardware in a different network, is configured in a very similar way and does not show these issues (so far). Host: - AMD Athlon64 3500+ - Debian testing amd64 - Kernel 2.6.32-trunk from Debian - Debian qemu-kvm 0.11.1+dfsg-1 The system has a routed uplink (Intel e100) and an internal bridge, that connects all the VMs (only virtual ports). Networking in the VMs is usually configured like this: -net nic,model=virtio,macaddr=00:16:3E:7C:30:AA -net tap,ifname=vm.compile,script=no The guests are Debian amd64, either testing or stable, running a custom stripped 2.6.31.6 or 2.6.32.7. I've also tested older kernels (2.6.28.2 and 2.6.30.4) and have seen the same problem. Problem: IPv6 tx throughput from a Guest to a host in the internet is extremely low (around 16kB/s). IPv6 rx is fine, as well as throughput in IPv4 (both ways). Also fine is throughput from the guest to the host and from the host to the internet (also both ways). Running tcpdump on the tap-device on the host while doing an scp to another system shows this: 20:43:27.130438 IP6 GUEST.59864 DEST.22: Flags [.], seq 70785:73641, ack 2128, win 327, options [nop,nop,TS val 4294920148 ecr 63373335], length 2856 20:43:27.130506 IP6 HOST GUEST: ICMP6, packet too big, mtu 1500, length 1240 20:43:27.131965 IP6 DEST.22 GUEST.59864: Flags [.], ack 69357, win 501, options [nop,nop,TS val 63373335 ecr 4294920144], length 0 20:43:27.131996 IP6 DEST.22 GUEST.59864: Flags [.], ack 70785, win 501, options [nop,nop,TS val 63373335 ecr 4294920144], length 0 20:43:27.132651 IP6 GUEST.59864 DEST.22: Flags [.], seq 73641:76497, ack 2128, win 327, options [nop,nop,TS val 4294920148 ecr 63373335], length 2856 20:43:27.132704 IP6 HOST GUEST: ICMP6, packet too big, mtu 1500, length 1240 20:43:27.346347 IP6 GUEST.59864 DEST.22: Flags [.], seq 70785:72213, ack 2128, win 327, options [nop,nop,TS val 4294920202 ecr 63373335], length 1428 20:43:27.360411 IP6 DEST.22 GUEST.59864: Flags [.], ack 72213, win 501, options [nop,nop,TS val 63373358 ecr 4294920202], length 0 20:43:27.361045 IP6 GUEST.59864 DEST.22: Flags [.], seq 76497:79353, ack 2128, win 327, options [nop,nop,TS val 4294920205 ecr 63373358], length 2856 So, the guest sends packets of 2856 bytes, which are too large to pass through the eth0 of the host (1500 bytes). Thus, the host rejects the packet and sends back an ICMPv6 packet too big, which sometimes makes the guest send at least one packet with the small MTU, but sometimes it doesn't. This is repeated all over and slows down the process. According to ethtool GSO is enabled on the guest, but disabling it using ethtool does not have any effect. But giving the kernel virtio_net.gso=0 in append fixes the issue completely. Anyone having any ideas? I ran into a similar problem when using Intel e1000e driver on the host. Found a bug in most of the intel ethernet drivers when handling large GSO IPv6 packets. You can see the fix i submitted here http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8e1e8a4779cb23c1d9f51e9223795e07ec54d77a But you seem to be using Intel e100 driver. So i am not sure you are running into the same issue. You could try disabling both tso and gso on the guest using ethtool. On the host, do you have gso/tso enabled on the uplink e100 driver? Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wed, 2010-01-27 at 22:39 +0100, Arnd Bergmann wrote: On Wednesday 27 January 2010, Anthony Liguori wrote: I think -net socket,fd should just be (trivially) extended to work with raw sockets out of the box, with no support for opening it. Then you can have libvirt or some wrapper open a raw socket and a private namespace and just pass it down. That'd work. Anthony? The fundamental problem that I have with all of this is that we should not be introducing new network backends that are based around something only a developer is going to understand. If I'm a user and I want to use an external switch in VEPA mode, how in the world am I going to know that I'm supposed to use the -net raw backend or the -net socket backend? It might as well be the -net butterflies backend as far as a user is concerned. My point is that we already have -net socket,fd and any user that passes an fd into that already knows what he wants to do with it. Making it work with raw sockets is just a natural extension to this, which works on all kernels and (with separate namespaces) is reasonably secure. Didn't realize that -net socket is already there and supports TCP and UDP sockets. I will look into extending -net socket to support AF_PACKET SOCK_RAW type sockets. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
This patch adds raw socket backend to qemu and is based on Or Gerlitz's patch re-factored and ported to the latest qemu-kvm git tree. It also includes support for vnet_hdr option that enables gso/checksum offload with raw backend. You can find the linux kernel patch to support this feature here. http://thread.gmane.org/gmane.linux.network/150308 Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/Makefile.objs b/Makefile.objs index 357d305..4468124 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -34,6 +34,8 @@ net-nested-$(CONFIG_SOLARIS) += tap-solaris.o net-nested-$(CONFIG_AIX) += tap-aix.o net-nested-$(CONFIG_SLIRP) += slirp.o net-nested-$(CONFIG_VDE) += vde.o +net-nested-$(CONFIG_POSIX) += raw.o +net-nested-$(CONFIG_LINUX) += raw-linux.o net-obj-y += $(addprefix net/, $(net-nested-y)) ## diff --git a/hw/virtio-net.c b/hw/virtio-net.c index eba578a..4aa40f2 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -15,6 +15,7 @@ #include net.h #include net/checksum.h #include net/tap.h +#include net/raw.h #include qemu-timer.h #include virtio-net.h @@ -133,6 +134,9 @@ static int peer_has_vnet_hdr(VirtIONet *n) case NET_CLIENT_TYPE_TAP: n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer); break; +case NET_CLIENT_TYPE_RAW: +n-has_vnet_hdr = raw_has_vnet_hdr(n-nic-nc.peer); +break; default: return 0; } @@ -149,6 +153,9 @@ static int peer_has_ufo(VirtIONet *n) case NET_CLIENT_TYPE_TAP: n-has_ufo = tap_has_ufo(n-nic-nc.peer); break; +case NET_CLIENT_TYPE_RAW: +n-has_ufo = raw_has_ufo(n-nic-nc.peer); +break; default: return 0; } @@ -165,6 +172,9 @@ static void peer_using_vnet_hdr(VirtIONet *n, int using_vnet_hdr) case NET_CLIENT_TYPE_TAP: tap_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr); break; +case NET_CLIENT_TYPE_RAW: +raw_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr); +break; default: break; } @@ -180,6 +190,9 @@ static void peer_set_offload(VirtIONet *n, int csum, int tso4, int tso6, case NET_CLIENT_TYPE_TAP: tap_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo); break; +case NET_CLIENT_TYPE_RAW: +raw_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo); +break; default: break; } diff --git a/net.c b/net.c index 6ef93e6..1ca2415 100644 --- a/net.c +++ b/net.c @@ -26,6 +26,7 @@ #include config-host.h #include net/tap.h +#include net/raw.h #include net/socket.h #include net/dump.h #include net/slirp.h @@ -1004,6 +1005,27 @@ static struct { }, { /* end of list */ } }, +}, { +.type = raw, +.init = net_init_raw, +.desc = { +NET_COMMON_PARAMS_DESC, +{ +.name = fd, +.type = QEMU_OPT_STRING, +.help = file descriptor of an already opened raw socket, +}, { +.name = ifname, +.type = QEMU_OPT_STRING, +.help = interface name, + }, { + .name = vnet_hdr, + .type = QEMU_OPT_BOOL, + .help = enable PACKET_VNET_HDR option on the raw interface + }, +{ /* end of list */ } + }, + #ifdef CONFIG_VDE }, { .type = vde, @@ -1076,6 +1098,7 @@ int net_client_init(Monitor *mon, QemuOpts *opts, int is_netdev) #ifdef CONFIG_VDE strcmp(type, vde) != 0 #endif +strcmp(type, raw) != 0 strcmp(type, socket) != 0) { qemu_error(The '%s' network backend type is not valid with -netdev\n, type); diff --git a/net.h b/net.h index 116bb80..4722185 100644 --- a/net.h +++ b/net.h @@ -34,7 +34,8 @@ typedef enum { NET_CLIENT_TYPE_TAP, NET_CLIENT_TYPE_SOCKET, NET_CLIENT_TYPE_VDE, -NET_CLIENT_TYPE_DUMP +NET_CLIENT_TYPE_DUMP, +NET_CLIENT_TYPE_RAW, } net_client_type; typedef void (NetPoll)(VLANClientState *, bool enable); diff --git a/net/raw-linux.c b/net/raw-linux.c new file mode 100644 index 000..9ed2e6a --- /dev/null +++ b/net/raw-linux.c @@ -0,0 +1,97 @@ +/* + * QEMU System Emulator + * + * Copyright (c) 2003-2008 Fabrice Bellard + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies
[PATCH qemu-kvm] Add generic peer_* routines for the remaining tap specific routines
This patch adds generic peer routines for the remaining tap specific routines(using_vnet_hdr set_offload). This makes it easier to add new backends like raw(packet sockets) that support gso/checksum-offload. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 6e48997..eba578a 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -129,10 +129,13 @@ static int peer_has_vnet_hdr(VirtIONet *n) if (!n-nic-nc.peer) return 0; -if (n-nic-nc.peer-info-type != NET_CLIENT_TYPE_TAP) -return 0; - -n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer); +switch (n-nic-nc.peer-info-type) { +case NET_CLIENT_TYPE_TAP: +n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer); +break; +default: +return 0; +} return n-has_vnet_hdr; } @@ -142,11 +145,46 @@ static int peer_has_ufo(VirtIONet *n) if (!peer_has_vnet_hdr(n)) return 0; -n-has_ufo = tap_has_ufo(n-nic-nc.peer); +switch (n-nic-nc.peer-info-type) { +case NET_CLIENT_TYPE_TAP: +n-has_ufo = tap_has_ufo(n-nic-nc.peer); +break; +default: +return 0; +} return n-has_ufo; } +static void peer_using_vnet_hdr(VirtIONet *n, int using_vnet_hdr) +{ +if (!n-nic-nc.peer) +return; + +switch (n-nic-nc.peer-info-type) { +case NET_CLIENT_TYPE_TAP: +tap_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr); +break; +default: +break; +} +} + +static void peer_set_offload(VirtIONet *n, int csum, int tso4, int tso6, + int ecn, int ufo) +{ +if (!n-nic-nc.peer) +return; + +switch (n-nic-nc.peer-info-type) { +case NET_CLIENT_TYPE_TAP: +tap_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo); +break; +default: +break; +} +} + static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features) { VirtIONet *n = to_virtio_net(vdev); @@ -154,7 +192,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features) features |= (1 VIRTIO_NET_F_MAC); if (peer_has_vnet_hdr(n)) { -tap_using_vnet_hdr(n-nic-nc.peer, 1); +peer_using_vnet_hdr(n, 1); } else { features = ~(0x1 VIRTIO_NET_F_CSUM); features = ~(0x1 VIRTIO_NET_F_HOST_TSO4); @@ -197,7 +235,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) n-mergeable_rx_bufs = !!(features (1 VIRTIO_NET_F_MRG_RXBUF)); if (n-has_vnet_hdr) { -tap_set_offload(n-nic-nc.peer, +peer_set_offload(n, (features VIRTIO_NET_F_GUEST_CSUM) 1, (features VIRTIO_NET_F_GUEST_TSO4) 1, (features VIRTIO_NET_F_GUEST_TSO6) 1, @@ -761,8 +799,8 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id) } if (n-has_vnet_hdr) { -tap_using_vnet_hdr(n-nic-nc.peer, 1); -tap_set_offload(n-nic-nc.peer, +peer_using_vnet_hdr(n, 1); +peer_set_offload(n, (n-vdev.guest_features VIRTIO_NET_F_GUEST_CSUM) 1, (n-vdev.guest_features VIRTIO_NET_F_GUEST_TSO4) 1, (n-vdev.guest_features VIRTIO_NET_F_GUEST_TSO6) 1, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Tue, 2010-01-26 at 14:47 -0600, Anthony Liguori wrote: On 01/26/2010 02:40 PM, Sridhar Samudrala wrote: This patch adds raw socket backend to qemu and is based on Or Gerlitz's patch re-factored and ported to the latest qemu-kvm git tree. It also includes support for vnet_hdr option that enables gso/checksum offload with raw backend. You can find the linux kernel patch to support this feature here. http://thread.gmane.org/gmane.linux.network/150308 Signed-off-by: Sridhar Samudralas...@us.ibm.com See the previous discussion about the raw backend from Or's original patch. There's no obvious reason why we should have this in addition to a tun/tap backend. The only use-case I know of is macvlan but macvtap addresses this functionality while not introduce the rather nasty security problems associated with a raw backend. The raw backend can be attached to a physical device, macvlan or SR-IOV VF. I don't think AF_PACKET socket itself introduces any security problems. The raw socket can be created only by a user with CAP_RAW capability. The only issue is if we need to assume that qemu itself is an untrusted process and a raw fd cannot be passed to it. But, i think it is a useful backend to support in qemu that provides guest to remote host connectivity without the need for a bridge/tap. macvtap could be an alternative if it supports binding to SR-IOV VFs too. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Tue, 2010-01-26 at 14:50 -0600, Anthony Liguori wrote: On 01/26/2010 02:47 PM, Anthony Liguori wrote: On 01/26/2010 02:40 PM, Sridhar Samudrala wrote: This patch adds raw socket backend to qemu and is based on Or Gerlitz's patch re-factored and ported to the latest qemu-kvm git tree. It also includes support for vnet_hdr option that enables gso/checksum offload with raw backend. You can find the linux kernel patch to support this feature here. http://thread.gmane.org/gmane.linux.network/150308 Signed-off-by: Sridhar Samudralas...@us.ibm.com See the previous discussion about the raw backend from Or's original patch. There's no obvious reason why we should have this in addition to a tun/tap backend. The only use-case I know of is macvlan but macvtap addresses this functionality while not introduce the rather nasty security problems associated with a raw backend. Not to mention that from a user perspective, raw makes almost no sense as it's an obscure socket protocol family. Not clear what you mean here. AF_PACKET socket is just a transport mechanism for the host kernel to put the packets from the guest directly to an attached interface and vice-versa. A user wants to do useful things like bridged networking or direct VF assignment. We should have -net backends that reflect things that make sense to a user. Binding to a SR-IOV VF is one of the use-case that is supported by raw backend. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
virtio-net offloads not enabled with latest qemu-kvm
With the latest upstream qemu-kvm git tree, all the offloads are disabled on virtio-net. peer_has_vnet_hdr(n) in virtio_net_get_features() is failing because n-vc-peer is NULL. Could not figure out yet why peer field is not initialized. Do i need any new options to be specified with qemu command? qemu-system-x86_64 -m 1024 -kernel /boot/vmlinuz-2.6.32-guest -append 'root=/dev/vda1 console=tty0 console=ttyS0,115200' -initrd /boot/initrd-2.6.32-guest.img -drive file=/kvm_images/fedora10-2-vm,if=virtio,index=0 -net nic,macaddr=54:52:00:35:e3:74,model=virtio -net tap,ifname=vnet1,script=no,downscript=no Works fine with qemu-kvm-0.11.0 and i see checksum/tso/gso are enabled on guest virtio-net device. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net patches
On Thu, 2009-10-22 at 19:43 +0200, Michael S. Tsirkin wrote: Possibly we'll have to debug this in vhost in host kernel. I would debug this directly, it's just that my setup is somehow different and I do not see this issue, otherwise I would not waste your time. Can we add some printks? handle_tx has this at the top: if (!sock || !sock_writeable(sock-sk)) return; I added some debug printks in handle_rx and handle_tx get_user() calls are failing with EFAULT. Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: handle_rx: net:f5494800 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to enable notification: -14 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to disable notification: -14 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:51:48 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:51:48 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:51:49 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:51:49 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:51:53 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:51:53 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:52:03 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:52:03 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:52:22 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:52:22 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:52:23 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:52:23 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:53:17 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:53:17 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Oct 21 10:56:56 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800 Oct 21 10:56:56 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx at 0002 ret:-14 Thanks Sridhar Could you please add printk(KERN_ERR %s: !sock = %d, !sock || !sock_writeable(sock-sk) = %d, __func__, !sock , !sock || !sock_writeable(sock-sk)); *Before* these checks? Then make modules modules_install rmmod vhost_net insmod vhost_net and re-run. If you want me to send a patch that does this, let me know please. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net patches
On Sun, 2009-10-18 at 19:32 +0200, Michael S. Tsirkin wrote: On Sun, Oct 18, 2009 at 12:53:56PM +0200, Michael S. Tsirkin wrote: On Fri, Oct 16, 2009 at 12:29:29PM -0700, Sridhar Samudrala wrote: Hi Michael, We are trying out your vhost-net patches from your git trees on kernel.org. I am using mst/vhost.git as host kernel and mst/qemu-kvm.git for qemu. I am using the following qemu script to start the guest using userspace tap backend. home/sridhar/git/mst/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 /home/sridhar/kvm_images/fedora10-1-vm -m 512 -drive file=/home/sridhar/kvm_images/fedora10-1-vm,if=virtio,index=0,boot=on -net nic,macaddr=54:52:00:35:e3:73,model=virtio -net tap,ifname=vnet0,script=no,downscript=no Now that i got the default backend to work, i wanted to try vhost in kernel. But could not figure out the right -net option to pass to qemu. Can you let me know the right syntax to start a guest using vhost. Thanks Sridhar Here's an example with raw socket: /root/kvm-test/bin/qemu-system-x86_64 -m 1G -kernel \ /boot/vmlinuz-$release -append \ 'root=UUID=d5d2d201-d086-42ad-bb1d-32fbe40eda71 ro quiet nosplash \ console=tty0 console=ttyS0,9600n8' -initrd /boot/guest-initrd.img \ $HOME/disk.raw.copy -net raw,ifname=eth3 -net nic,model=virtio,vhost \ -balloon none -redir tcp:8023::22 As you see, I changed the command line. You now simply add ,vhost after model, and it will locate host network interface specified earlier and attach to it. This should have been clear from running qemu with -help flag. Could you please suggest how can that text be clarified? I updated to your latest git trees and the default user-space tap backend using the following -net options worked fine. -net tap,ifname=vnet0,script=no,downscript=no -net nic,model=virtio But i could not get vhost to work with either raw or tap backends. I tried the following combinations. 1) -net raw,ifname=eth0 -net nic,model=virtio,vhost 2) -net raw,ifname=vnet0, -net nic,model=virtio,vhost 3) -net tap,ifname=vnet0,script=no,downscript=no -net nic,model=virtio,vhost They all failed with the following error vhost_net_init returned -7 This is an error message from hw/virtio-net.c:virtio_net_driver_ok() when vhost_net_start() fails. It looks like dev-binding-irqfd() is failing in vhost_virtqueue_init(). Haven't yet debugged further. I have CONFIG_EVENTFD enabled in the host kernel. Are all the above -net options supposed to work? In your descriptions, you say that checksum/tso offload is not supported. Isn't it possible to send/receive large packets without checksum using AF_PACKET sockets if the attached interface supports these offloads. Do you see the same offload issue even when using tap backend via vhost? Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Release plan for 0.12.0
On Wed, 2009-10-14 at 17:50 +0200, Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 04:19:17PM +0100, Jamie Lokier wrote: Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 09:17:15AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Looks like Or has abandoned it. I have an updated version which works with new APIs, etc. Let me post it and we'll go from there. I'm generally inclined to oppose the functionality as I don't think it offers any advantages over the existing backends. I patch it in and use it all the time. It's much easier to setup on a random machine than a bridged config. Having two things that do the same thing is just going to lead to user confusion. They do not do the same thing. With raw socket you can use windows update without a bridge in the host, with tap you can't. On the other hand, with raw socket, guest Windows can't access files on the host's Samba share can it? So it's not that useful even for Windows guests. I guess this depends on whether you use the same host for samba :) If the problem is tap is too hard to setup, we should try to simplify tap configuration. The problem is bridge is too hard to setup. Simplifying that is a good idea, but outside the scope of the qemu project. I venture it's important enough for qemu that it's worth working on that. Something that looks like the raw socket but behaves like an automatically instantiated bridge attached to the bound interface would be a useful interface. I agree, that would be good to have. Can't we bind the raw socket to the tap interface instead of the physical interface and allow the bridge config to work. Thanks Sridhar I don't have much time, but I'll help anybody who wants to do that. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Enable UFO on virtio-net and tap devices
On Thu, 2009-10-08 at 11:07 +0100, Mark McLoughlin wrote: On Wed, 2009-10-07 at 14:50 -0700, Sridhar Samudrala wrote: linux 2.6.32 includes UDP fragmentation offload support in software. So we can enable UFO on the host tap device if supported and allow setting UFO on virtio-net in the guest. Hmm, we really need to detect whether the host has tuntap UFO support before advertising it to the guest. Maybe in net_tap_fd_init() we should toggle TUN_F_UFO back and forth and check for EINVAL? Sure. Here is an updated patch that checks for UFO support in host. Thanks Sridhar Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ce8e6cb..9561f34 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -150,7 +150,10 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) features |= (1 VIRTIO_NET_F_HOST_TSO6); features |= (1 VIRTIO_NET_F_HOST_ECN); features |= (1 VIRTIO_NET_F_MRG_RXBUF); -/* Kernel can't actually handle UFO in software currently. */ +if (tap_has_ufo(host)) { +features |= (1 VIRTIO_NET_F_GUEST_UFO); +features |= (1 VIRTIO_NET_F_HOST_UFO); +} } #endif @@ -189,7 +192,8 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) (features VIRTIO_NET_F_GUEST_CSUM) 1, (features VIRTIO_NET_F_GUEST_TSO4) 1, (features VIRTIO_NET_F_GUEST_TSO6) 1, - (features VIRTIO_NET_F_GUEST_ECN) 1); + (features VIRTIO_NET_F_GUEST_ECN) 1, + (features VIRTIO_NET_F_GUEST_UFO) 1); #endif } diff --git a/net.c b/net.c index 8032ff8..838e42c 100644 --- a/net.c +++ b/net.c @@ -1271,6 +1271,11 @@ void tap_using_vnet_hdr(void *opaque, int using_vnet_hdr) { } +int tap_has_ufo(void *opaque) +{ +return 0; +} + #else /* !defined(_WIN32) */ /* Maximum GSO packet size (64k) plus plenty of room for @@ -1292,6 +1297,7 @@ typedef struct TAPState { unsigned int write_poll : 1; unsigned int has_vnet_hdr : 1; unsigned int using_vnet_hdr : 1; +unsigned int has_ufo: 1; } TAPState; static int launch_script(const char *setup_script, const char *ifname, int fd); @@ -1527,9 +1533,22 @@ static int tap_probe_vnet_hdr(int fd) #endif } +int tap_has_ufo(void *opaque) +{ +VLANClientState *vc = opaque; +TAPState *s = vc-opaque; + +return s ? s-has_ufo : 0; +} + #ifdef TUNSETOFFLOAD + +#ifndef TUN_F_UFO +#define TUN_F_UFO 0x10 +#endif + static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, - int ecn) + int ecn, int ufo) { TAPState *s = vc-opaque; unsigned int offload = 0; @@ -1542,11 +1561,18 @@ static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, offload |= TUN_F_TSO6; if ((tso4 || tso6) ecn) offload |= TUN_F_TSO_ECN; + if (ufo) + offload |= TUN_F_UFO; } -if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) - fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, - strerror(errno)); +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { +/* Try without UFO */ +offload = ~TUN_F_UFO; +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { + fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, + strerror(errno)); +} +} } #endif /* TUNSETOFFLOAD */ @@ -1574,6 +1600,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int vnet_hdr) { TAPState *s; +unsigned int offload; s = qemu_mallocz(sizeof(TAPState)); s-fd = fd; @@ -1583,7 +1610,14 @@ static TAPState *net_tap_fd_init(VLANState *vlan, s-vc-receive_raw = tap_receive_raw; #ifdef TUNSETOFFLOAD s-vc-set_offload = tap_set_offload; -tap_set_offload(s-vc, 0, 0, 0, 0); + +s-has_ufo = 0; +/* Check if tap supports UFO */ +offload = TUN_F_CSUM | TUN_F_UFO; +if (ioctl(s-fd, TUNSETOFFLOAD, offload) == 0) + s-has_ufo = 1; + +tap_set_offload(s-vc, 0, 0, 0, 0, 0); #endif tap_read_poll(s, 1); snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd); diff --git a/net.h b/net.h index 925c67c..6bb6434 100644 --- a/net.h +++ b/net.h @@ -14,7 +14,7 @@ typedef ssize_t (NetReceive)(VLANClientState *, const uint8_t *, size_t); typedef ssize_t (NetReceiveIOV)(VLANClientState *, const struct iovec *, int); typedef void (NetCleanup) (VLANClientState *); typedef void (LinkStatusChanged)(VLANClientState *); -typedef void (SetOffload)(VLANClientState *, int, int, int, int); +typedef void (SetOffload)(VLANClientState *, int, int, int, int, int); struct VLANClientState { NetReceive *receive; @@ -92,6 +92,7 @@ void do_info_usernet(Monitor *mon); int tap_has_vnet_hdr(void *opaque); void tap_using_vnet_hdr
[PATCH qemu-kvm] Enable UFO on virtio-net and tap devices
linux 2.6.32 includes UDP fragmentation offload support in software. So we can enable UFO on the host tap device if supported and allow setting UFO on virtio-net in the guest. This improves UDP stream performance significantly between guest to host and inter-guest. TUN_F_UFO is a new #define added to 2.6.32 kernel header file include/linux/if_tun.h. Until this updated header file gets into distro releases, i think we need to have this defined in qemu. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ce8e6cb..c73487d 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -150,7 +150,8 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) features |= (1 VIRTIO_NET_F_HOST_TSO6); features |= (1 VIRTIO_NET_F_HOST_ECN); features |= (1 VIRTIO_NET_F_MRG_RXBUF); -/* Kernel can't actually handle UFO in software currently. */ +features |= (1 VIRTIO_NET_F_GUEST_UFO); +features |= (1 VIRTIO_NET_F_HOST_UFO); } #endif @@ -189,7 +190,8 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) (features VIRTIO_NET_F_GUEST_CSUM) 1, (features VIRTIO_NET_F_GUEST_TSO4) 1, (features VIRTIO_NET_F_GUEST_TSO6) 1, - (features VIRTIO_NET_F_GUEST_ECN) 1); + (features VIRTIO_NET_F_GUEST_ECN) 1, + (features VIRTIO_NET_F_GUEST_UFO) 1); #endif } diff --git a/net.c b/net.c index 8032ff8..1942e25 100644 --- a/net.c +++ b/net.c @@ -1528,8 +1528,13 @@ static int tap_probe_vnet_hdr(int fd) } #ifdef TUNSETOFFLOAD + +#ifndef TUN_F_UFO +#define TUN_F_UFO 0x10 +#endif + static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, - int ecn) + int ecn, int ufo) { TAPState *s = vc-opaque; unsigned int offload = 0; @@ -1542,11 +1547,18 @@ static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, offload |= TUN_F_TSO6; if ((tso4 || tso6) ecn) offload |= TUN_F_TSO_ECN; + if (ufo) + offload |= TUN_F_UFO; } -if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) - fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, - strerror(errno)); +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { +/* Try without UFO */ +offload = ~TUN_F_UFO; +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { + fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, + strerror(errno)); +} +} } #endif /* TUNSETOFFLOAD */ @@ -1583,7 +1595,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, s-vc-receive_raw = tap_receive_raw; #ifdef TUNSETOFFLOAD s-vc-set_offload = tap_set_offload; -tap_set_offload(s-vc, 0, 0, 0, 0); +tap_set_offload(s-vc, 0, 0, 0, 0, 0); #endif tap_read_poll(s, 1); snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd); diff --git a/net.h b/net.h index 925c67c..ac3701c 100644 --- a/net.h +++ b/net.h @@ -14,7 +14,7 @@ typedef ssize_t (NetReceive)(VLANClientState *, const uint8_t *, size_t); typedef ssize_t (NetReceiveIOV)(VLANClientState *, const struct iovec *, int); typedef void (NetCleanup) (VLANClientState *); typedef void (LinkStatusChanged)(VLANClientState *); -typedef void (SetOffload)(VLANClientState *, int, int, int, int); +typedef void (SetOffload)(VLANClientState *, int, int, int, int, int); struct VLANClientState { NetReceive *receive; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
Rusty Russell wrote: On Mon, 8 Jun 2009 02:46:08 pm Herbert Xu wrote: On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote: + /* Software UFO is not yet supported */ + segs = ERR_PTR(-EPROTONOSUPPORT); Hmm, we need to fill this in before you start using it for virt. After all, it's very difficult for the guest to know whether the output device on the host is going to be able to do UFO or not. Yep, no point telling the guest we support UFO if we don't! This path is taken only when the packet needs to go out a device that doesn't support UFO. Until the software UFO is supported, I thought it may be useful to provide a way to enable UFO inter-guest, between host to guest and to an external host via an outgoing device that supports UFO. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
Herbert Xu wrote: On Mon, Jun 08, 2009 at 10:04:47AM -0700, Sridhar Samudrala wrote: OK. Can we use skb_segment() to do IP fragmentation of UDP packets? It should be able to. Unfortunately, this doesn't work for UDP without any changes. skb_segment() currently adds transport header to each segmented skb. But when we are using IP fragmentation, only the first fragment should include the UDP header. We either need to fix skb_segment() to handle IP fragmentation or write a new skb_fragment(). I will look into this when i get some time. Thanks Sridhar The function itself looks protocol independent and if it is simply splitting large skb into a list of mtu sized skb's, can we fixup the ip header differently for UDP packets(id and frag_off) in inet_gso_segment and ipv6_gso_segment? Yes that's the idea. Cheers, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Enable UFO on virtio-net/tap devices
On Sun, 2009-06-07 at 09:21 +0300, Avi Kivity wrote: Sridhar Samudrala wrote: Enable UFO on the host tap device if supported and allow setting UFO on virtio-net in the guest. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 3c77b99..8a53e27 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -134,7 +134,8 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) features |= (1 VIRTIO_NET_F_HOST_TSO6); features |= (1 VIRTIO_NET_F_HOST_ECN); features |= (1 VIRTIO_NET_F_MRG_RXBUF); -/* Kernel can't actually handle UFO in software currently. */ + // features |= (1 VIRTIO_NET_F_HOST_UFO); +features |= (1 VIRTIO_NET_F_GUEST_UFO); Where are these defined? They are in hw/virtio_net.h @@ -990,8 +990,13 @@ static int tap_probe_vnet_hdr(int fd) } #ifdef TUNSETOFFLOAD + +#ifndef TUN_F_UFO +#define TUN_F_UFO 0x10 +#endif + Should just use the definition in the kernel header. TUN_F_UFO is not defined in the current version of kernel header linux/if_tun.h. I have added this in my kernel patches to support UFO. So until this definition gets into the distro versions of kernel headers, i think we need to have this defined in qemu. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
On Mon, 2009-06-08 at 15:16 +1000, Herbert Xu wrote: On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote: + /* Software UFO is not yet supported */ + segs = ERR_PTR(-EPROTONOSUPPORT); Hmm, we need to fill this in before you start using it for virt. After all, it's very difficult for the guest to know whether the output device on the host is going to be able to do UFO or not. OK. Can we use skb_segment() to do IP fragmentation of UDP packets? Or does it only support TCP segementation? The function itself looks protocol independent and if it is simply splitting large skb into a list of mtu sized skb's, can we fixup the ip header differently for UDP packets(id and frag_off) in inet_gso_segment and ipv6_gso_segment? Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/4] IPv6: HW csum support for outgoing large UFO packets
Add HW checksum support for outgoing large UDP/IPv6 packets destined to a UFO enabled device. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- net/ipv6/udp.c | 51 ++- 1 files changed, 50 insertions(+), 1 deletions(-) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index fc333d8..fad0f5f 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -638,6 +638,47 @@ static void udp_v6_flush_pending_frames(struct sock *sk) } } +/** + * udp6_hwcsum_outgoing - handle outgoing HW checksumming + * @sk:socket we are sending on + * @skb: sk_buff containing the filled-in UDP header + * (checksum field must be zeroed out) + */ +static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb, +const struct in6_addr *saddr, +const struct in6_addr *daddr, int len) +{ + unsigned int offset; + struct udphdr *uh = udp_hdr(skb); + __wsum csum = 0; + + if (skb_queue_len(sk-sk_write_queue) == 1) { + /* Only one fragment on the socket. */ + skb-csum_start = skb_transport_header(skb) - skb-head; + skb-csum_offset = offsetof(struct udphdr, check); + uh-check = csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP, 0); + } else { + /* +* HW-checksum won't work as there are two or more +* fragments on the socket so that all csums of sk_buffs +* should be together +*/ + offset = skb_transport_offset(skb); + skb-csum = skb_checksum(skb, offset, skb-len - offset, 0); + + skb-ip_summed = CHECKSUM_NONE; + + skb_queue_walk(sk-sk_write_queue, skb) { + csum = csum_add(csum, skb-csum); + } + + uh-check = csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP, + csum); + if (uh-check == 0) + uh-check = CSUM_MANGLED_0; + } +} + /* * Sending */ @@ -668,7 +709,14 @@ static int udp_v6_push_pending_frames(struct sock *sk) if (is_udplite) csum = udplite_csum_outgoing(sk, skb); -else +else if (sk-sk_no_check == UDP_CSUM_NOXMIT) { /* UDP csum disabled */ + skb-ip_summed = CHECKSUM_NONE; + goto send; + } else if (skb-ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */ + udp6_hwcsum_outgoing(sk, skb, fl-fl6_src, fl-fl6_dst, +up-len); + goto send; + } else csum = udp_csum_outgoing(sk, skb); /* add protocol-dependent pseudo-header */ @@ -677,6 +725,7 @@ static int udp_v6_push_pending_frames(struct sock *sk) if (uh-check == 0) uh-check = CSUM_MANGLED_0; +send: err = ip6_push_pending_frames(sk); out: up-len = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
Setup partial checksum and add gso checks to handle large UFO packets from untrusted sources. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- include/net/udp.h |3 +++ net/ipv4/af_inet.c |2 ++ net/ipv4/udp.c | 60 net/ipv6/udp.c | 22 +++ 4 files changed, 87 insertions(+), 0 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 90e6ce5..274b764 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -207,4 +207,7 @@ extern void udp4_proc_exit(void); #endif extern void udp_init(void); + +extern int udp_v4_gso_send_check(struct sk_buff *skb); +extern struct sk_buff *udp_ufo_fragment(struct sk_buff *skb, int features); #endif /* _UDP_H */ diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index d873621..1c1ac7d 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1426,6 +1426,8 @@ static struct net_protocol tcp_protocol = { static struct net_protocol udp_protocol = { .handler = udp_rcv, .err_handler = udp_err, + .gso_send_check = udp_v4_gso_send_check, + .gso_segment = udp_ufo_fragment, .no_policy =1, .netns_ok = 1, }; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8f4158d..678bdf9 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1815,6 +1815,66 @@ void __init udp_init(void) sysctl_udp_wmem_min = SK_MEM_QUANTUM; } +int udp_v4_gso_send_check(struct sk_buff *skb) +{ + const struct iphdr *iph; + struct udphdr *uh; + + if (!pskb_may_pull(skb, sizeof(*uh))) + return -EINVAL; + + iph = ip_hdr(skb); + uh = udp_hdr(skb); + + uh-check = 0; + uh-check = ~csum_tcpudp_magic(iph-saddr, iph-daddr, skb-len, IPPROTO_UDP, 0); + skb-csum_start = skb_transport_header(skb) - skb-head; + skb-csum_offset = offsetof(struct udphdr, check); + skb-ip_summed = CHECKSUM_PARTIAL; + return 0; +} + +struct sk_buff *udp_ufo_fragment(struct sk_buff *skb, int features) +{ + struct sk_buff *segs = ERR_PTR(-EINVAL); + struct udphdr *uh; + unsigned uhlen; + unsigned int mss; + + if (!pskb_may_pull(skb, sizeof(*uh))) + goto out; + + uh = udp_hdr(skb); + uhlen = sizeof(*uh); + + __skb_pull(skb, uhlen); + + mss = skb_shinfo(skb)-gso_size; + if (unlikely(skb-len = mss)) + goto out; + + if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { + int type = skb_shinfo(skb)-gso_type; + + if (unlikely(type ~(SKB_GSO_UDP | SKB_GSO_DODGY) || +!(type (SKB_GSO_UDP + goto out; + + skb_shinfo(skb)-gso_segs = DIV_ROUND_UP(skb-len, mss); + + segs = NULL; + goto out; + } + + /* Software UFO is not yet supported */ + segs = ERR_PTR(-EPROTONOSUPPORT); + +out: + return segs; +} +EXPORT_SYMBOL(udp_ufo_fragment); + + EXPORT_SYMBOL(udp_disconnect); EXPORT_SYMBOL(udp_ioctl); EXPORT_SYMBOL(udp_prot); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index fad0f5f..ccd42d5 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1081,9 +1081,31 @@ int compat_udpv6_getsockopt(struct sock *sk, int level, int optname, } #endif + +static int udpv6_gso_send_check(struct sk_buff *skb) +{ + struct ipv6hdr *ipv6h; + struct udphdr *uh; + + if (!pskb_may_pull(skb, sizeof(*uh))) + return -EINVAL; + + ipv6h = ipv6_hdr(skb); + uh = udp_hdr(skb); + + uh-check = 0; + uh-check = ~csum_ipv6_magic(ipv6h-saddr, ipv6h-daddr, skb-len, IPPROTO_UDP, 0); + skb-csum_start = skb_transport_header(skb) - skb-head; + skb-csum_offset = offsetof(struct udphdr, check); + skb-ip_summed = CHECKSUM_PARTIAL; + return 0; +} + static struct inet6_protocol udpv6_protocol = { .handler= udpv6_rcv, .err_handler= udpv6_err, + .gso_send_check = udpv6_gso_send_check, + .gso_segment= udp_ufo_fragment, .flags = INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/4] virtio_net: Allow UFO feature to be set on virtio_net device.
Allow UFO feature to be set on virtio_net device. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- drivers/net/virtio_net.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0c9ca67..d24ede0 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -773,6 +773,7 @@ static struct ethtool_ops virtnet_ethtool_ops = { .set_tx_csum = virtnet_set_tx_csum, .set_sg = ethtool_op_set_sg, .set_tso = ethtool_op_set_tso, + .set_ufo = ethtool_op_set_ufo, .get_link = ethtool_op_get_link, }; @@ -1013,7 +1014,7 @@ static unsigned int features[] = { VIRTIO_NET_F_GSO, VIRTIO_NET_F_MAC, VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, - VIRTIO_NET_F_GUEST_ECN, /* We don't yet handle UFO input. */ + VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO, VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ, VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_F_NOTIFY_ON_EMPTY, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 4/4] tun: Allow tap device to send/receive UFO packets.
Handle send/receive of UFO packets in tun/tap driver. Signed-off-by: Sridhar Samudrala s...@us.ibm.com --- drivers/net/tun.c | 13 - include/linux/if_tun.h |1 + 2 files changed, 13 insertions(+), 1 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 15cfcb2..96d23b8 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -635,6 +635,9 @@ static __inline__ ssize_t tun_get_user(struct tun_struct *tun, case VIRTIO_NET_HDR_GSO_TCPV6: skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6; break; + case VIRTIO_NET_HDR_GSO_UDP: + skb_shinfo(skb)-gso_type = SKB_GSO_UDP; + break; default: tun-dev-stats.rx_frame_errors++; kfree_skb(skb); @@ -720,6 +723,8 @@ static __inline__ ssize_t tun_put_user(struct tun_struct *tun, gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (sinfo-gso_type SKB_GSO_TCPV6) gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; + else if (sinfo-gso_type SKB_GSO_UDP) + gso.gso_type = VIRTIO_NET_HDR_GSO_UDP; else BUG(); if (sinfo-gso_type SKB_GSO_TCP_ECN) @@ -1068,7 +1073,8 @@ static int set_offload(struct net_device *dev, unsigned long arg) old_features = dev-features; /* Unset features, set them as we chew on the arg. */ features = (old_features ~(NETIF_F_HW_CSUM|NETIF_F_SG|NETIF_F_FRAGLIST - |NETIF_F_TSO_ECN|NETIF_F_TSO|NETIF_F_TSO6)); + |NETIF_F_TSO_ECN|NETIF_F_TSO|NETIF_F_TSO6 + |NETIF_F_UFO)); if (arg TUN_F_CSUM) { features |= NETIF_F_HW_CSUM|NETIF_F_SG|NETIF_F_FRAGLIST; @@ -1085,6 +1091,11 @@ static int set_offload(struct net_device *dev, unsigned long arg) features |= NETIF_F_TSO6; arg = ~(TUN_F_TSO4|TUN_F_TSO6); } + + if (arg TUN_F_UFO) { + features |= NETIF_F_UFO; + arg = ~TUN_F_UFO; + } } /* This gives the user a way to test for new features in future by diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index 915ba57..3f5fd52 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -62,6 +62,7 @@ #define TUN_F_TSO4 0x02/* I can handle TSO for IPv4 packets */ #define TUN_F_TSO6 0x04/* I can handle TSO for IPv6 packets */ #define TUN_F_TSO_ECN 0x08/* I can handle TSO with ECN bits. */ +#define TUN_F_UFO 0x10/* I can handle UFO packets */ /* Protocol info prepended to the packets (when IFF_NO_PI is not set) */ #define TUN_PKT_STRIP 0x0001 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH qemu-kvm] Enable UFO on virtio-net/tap devices
Enable UFO on the host tap device if supported and allow setting UFO on virtio-net in the guest. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 3c77b99..8a53e27 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -134,7 +134,8 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) features |= (1 VIRTIO_NET_F_HOST_TSO6); features |= (1 VIRTIO_NET_F_HOST_ECN); features |= (1 VIRTIO_NET_F_MRG_RXBUF); -/* Kernel can't actually handle UFO in software currently. */ + // features |= (1 VIRTIO_NET_F_HOST_UFO); +features |= (1 VIRTIO_NET_F_GUEST_UFO); } #endif @@ -173,6 +174,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) (features VIRTIO_NET_F_GUEST_CSUM) 1, (features VIRTIO_NET_F_GUEST_TSO4) 1, (features VIRTIO_NET_F_GUEST_TSO6) 1, + (features VIRTIO_NET_F_GUEST_UFO) 1, (features VIRTIO_NET_F_GUEST_ECN) 1); #endif } diff --git a/net.c b/net.c index 01e31db..e7dbcd0 100644 --- a/net.c +++ b/net.c @@ -990,8 +990,13 @@ static int tap_probe_vnet_hdr(int fd) } #ifdef TUNSETOFFLOAD + +#ifndef TUN_F_UFO +#define TUN_F_UFO 0x10 +#endif + static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, - int ecn) + int ecn, int ufo) { TAPState *s = vc-opaque; unsigned int offload = 0; @@ -1004,11 +1009,18 @@ static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, offload |= TUN_F_TSO6; if ((tso4 || tso6) ecn) offload |= TUN_F_TSO_ECN; + if (ufo) + offload |= TUN_F_UFO; } -if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) - fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, - strerror(errno)); +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { +/* Try without UFO */ +offload = ~TUN_F_UFO; +if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) { + fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n, + strerror(errno)); +} +} } #endif /* TUNSETOFFLOAD */ @@ -1043,7 +1055,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, s-vc-fd_read_raw = tap_receive_raw; #ifdef TUNSETOFFLOAD s-vc-set_offload = tap_set_offload; -tap_set_offload(s-vc, 0, 0, 0, 0); +tap_set_offload(s-vc, 0, 0, 0, 0, 0); #endif qemu_set_fd_handler2(s-fd, tap_can_send, tap_send, NULL, s); snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd); diff --git a/net.h b/net.h index 3d0b6f2..ecfb1f9 100644 --- a/net.h +++ b/net.h @@ -11,7 +11,7 @@ typedef struct VLANClientState VLANClientState; typedef void (NetCleanup) (VLANClientState *); typedef void (LinkStatusChanged)(VLANClientState *); -typedef void (SetOffload)(VLANClientState *, int, int, int, int); +typedef void (SetOffload)(VLANClientState *, int, int, int, int, int); struct VLANClientState { IOReadHandler *fd_read; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
virtio-net not working with the latest qemu-kvm git
When i moved to the latest qemu-kvm git tree from kvm-85, i noticed that networking stopped working between the host and the guest. It started working when i put the device in promiscuos mode by running tcpdump in background on the guest. After browsing through the recent patches, i found that the following commit is causing the regression. Remove stray GSO code from virtio_net (Mark McLoughlin) http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=commitdiff;h=559a8f45f34cc50d1a60b4f67a06614d506b2e01 The comment doesn't seem to match with the code that is removed with this patch. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html