Re: [PATCH v5 REPOST 0/6] fix hw_random stuck
On Mon, Dec 08, 2014 at 04:50:34PM +0800, Amos Kong wrote: When I hotunplug a busy virtio-rng device or try to access hwrng attributes in non-smp guest, it gets stuck. My hotplug tests: | test 0: | hotunplug rng device from qemu monitor | | test 1: | guest) # dd if=/dev/hwrng of=/dev/null | hotunplug rng device from qemu monitor | | test 2: | guest) # dd if=/dev/random of=/dev/null | hotunplug rng device from qemu monitor | | test 4: | guest) # dd if=/dev/hwrng of=/dev/null | cat /sys/devices/virtual/misc/hw_random/rng_* | | test 5: | guest) # dd if=/dev/hwrng of=/dev/null | cancel dd process after 10 seconds | guest) # dd if=/dev/hwrng of=/dev/null | hotunplug rng device from qemu monitor | | test 6: | use a fifo as rng backend, execute test 0 ~ 5 with no input of fifo All applied. Thanks a lot! -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/6] fix hw_random stuck
On Sat, Dec 06, 2014 at 12:16:36PM +0800, Amos Kong wrote: When I hotunplug a busy virtio-rng device or try to access hwrng attributes in non-smp guest, it gets stuck. Please resend these via the linux-crypto mailing list so they can be picked up by patchwork. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] hw_random: fix unregister race.
On Mon, Nov 10, 2014 at 09:47:27PM +0800, Herbert Xu wrote: On Mon, Nov 03, 2014 at 11:56:24PM +0800, Amos Kong wrote: @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref) if (rng-cleanup) rng-cleanup(rng); You need a compiler barrier here to prevent reordering. Michael Büsch pointed out that we should actually have a memory barrier here. I thought we didn't need it because I was only worried about the code in my original complaint. However, expecting driver writers to use correct synchronisation primitives is surely asking too much. So let's add an smp_wmb() here to ensure all side-effects of cleanup is visible. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] hw_random: fix unregister race.
On Mon, Nov 03, 2014 at 11:56:24PM +0800, Amos Kong wrote: @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref) if (rng-cleanup) rng-cleanup(rng); You need a compiler barrier here to prevent reordering. + rng-cleanup_done = true; Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] hw_random: fix unregister race.
On Sun, Nov 02, 2014 at 11:06:13PM +0800, Amos Kong wrote: On Fri, Oct 31, 2014 at 03:23:21PM +0800, Herbert Xu wrote: On Fri, Oct 31, 2014 at 10:28:00AM +1030, Rusty Russell wrote: Herbert Xu herb...@gondor.apana.org.au writes: On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote: From: Rusty Russell ru...@rustcorp.com.au The previous patch added one potential problem: we can still be reading from a hwrng when it's unregistered. Add a wait for zero in the hwrng_unregister path. Signed-off-by: Rusty Russell ru...@rustcorp.com.au You totally corrupted Rusty's patch. If you're going to repost his series you better make sure that you've got the right patches. Just as well though as it made me think a little more about this patch :) OK Amos, can you please repost the complete series? Please fix the unregister race I identified first. Ok, I redownload the patches from https://patchwork.kernel.org/project/LKML/list/?submitter=5 and rebase fixes of me and rusty on it. I will post a V4 later. Thanks. Please fix the unregister race I pointed out before doing a v4. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] hw_random: fix unregister race.
On Fri, Oct 31, 2014 at 10:28:00AM +1030, Rusty Russell wrote: Herbert Xu herb...@gondor.apana.org.au writes: On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote: From: Rusty Russell ru...@rustcorp.com.au The previous patch added one potential problem: we can still be reading from a hwrng when it's unregistered. Add a wait for zero in the hwrng_unregister path. Signed-off-by: Rusty Russell ru...@rustcorp.com.au You totally corrupted Rusty's patch. If you're going to repost his series you better make sure that you've got the right patches. Just as well though as it made me think a little more about this patch :) OK Amos, can you please repost the complete series? Please fix the unregister race I identified first. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] hw_random: fix unregister race.
On Thu, Sep 18, 2014 at 12:18:24PM +0930, Rusty Russell wrote: The previous patch added one potential problem: we can still be reading from a hwrng when it's unregistered. Add a wait for zero in the hwrng_unregister path. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/char/hw_random/core.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c index dc9092a1075d..b4a21e9521cf 100644 --- a/drivers/char/hw_random/core.c +++ b/drivers/char/hw_random/core.c @@ -60,6 +60,7 @@ static DEFINE_MUTEX(rng_mutex); static DEFINE_MUTEX(reading_mutex); static int data_avail; static u8 *rng_buffer, *rng_fillbuf; +static DECLARE_WAIT_QUEUE_HEAD(rng_done); static unsigned short current_quality; static unsigned short default_quality; /* = 0; default to off */ @@ -98,6 +99,7 @@ static inline void cleanup_rng(struct kref *kref) if (rng-cleanup) rng-cleanup(rng); + wake_up_all(rng_done); } static void set_current_rng(struct hwrng *rng) @@ -529,6 +531,9 @@ void hwrng_unregister(struct hwrng *rng) } mutex_unlock(rng_mutex); + + /* Just in case rng is reading right now, wait. */ + wait_event(rng_done, atomic_read(rng-ref.refcount) == 0); While it's obviously better than what we have now, I don't believe this is 100% safe as the cleanup function might still be running even after the ref count hits zero. Once we return from this function the module may be unloaded so we need to ensure that nothing is running at this point. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] hw_random: fix unregister race.
On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote: From: Rusty Russell ru...@rustcorp.com.au The previous patch added one potential problem: we can still be reading from a hwrng when it's unregistered. Add a wait for zero in the hwrng_unregister path. Signed-off-by: Rusty Russell ru...@rustcorp.com.au You totally corrupted Rusty's patch. If you're going to repost his series you better make sure that you've got the right patches. Just as well though as it made me think a little more about this patch :) Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] fix stuck in accessing hwrng attributes
On Tue, Sep 16, 2014 at 12:02:26AM +0800, Amos Kong wrote: If we read hwrng by long-running dd process, it takes too much cpu time and almost hold the mutex lock. When we check hwrng attributes from sysfs by cat, it gets stuck in waiting the lock releaseing. The problem can only be reproduced with non-smp guest with slow backend. This patchset resolves the issue by changing rng_dev_read() to always schedule 10 jiffies after release mutex lock, then cat process can have chance to get the lock and execute protected code without stuck. Sorry I'm not going to accept your fix which simply papers over the problem. Please bite the bullet and convert this over to RCU. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
On Wed, Feb 12, 2014 at 09:50:03AM +0800, Qin Chuanyu wrote: On 2014/2/11 23:24, Eric Dumazet wrote: diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 44c4db8..90b4e58 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); - netif_rx_ni(skb); + rcu_read_lock_bh(); + netif_receive_skb(skb); + rcu_read_unlock_bh(); tun-dev-stats.rx_packets++; tun-dev-stats.rx_bytes += len; I already said this patch is not good : rcu_read_lock_bh() makes no sense here. What is really needed is local_bh_disable(); Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much cleaner form. Just use it, CC him, credit him, please ? To: Herbert Xu I saw that you had delivered a patch to resolve the problem of cgroup. patch num is f845172531fb7410c7fb7780b1a6e51ee6df7d52 so would you deliver your patch for tun again? I had test it and found it work well. Feel free to resubmit my patch either as is or with your modifications as I no longer need it for my original purposes. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio-rng only returns zeros with CONFIG_HW_RANDOM=m
On Wed, Feb 27, 2013 at 11:56:55AM +1030, Rusty Russell wrote: OK, I looked at doing a kmalloc and copy in virtio_rng, but it's very inelegant (we don't know what size of buffer to allocate). As the hwrng API allows you to return any number of bytes, you can just go back to the old virtio-rng code that did the 64-byte buffer and return a maximum of 64 bytes each time. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] skbuff: clear tx zero-copy flag
On Mon, Jul 25, 2011 at 11:07:43AM +0300, Michael S. Tsirkin wrote: However macvtap passes an skb directly to the lower device, so as long as macvtap is the only user of that interface, we are fine I think - there's no way for an skb to get from macvtap to splice read path I think. Right? Yes, as long as you can guarantee that the skb never loops back then you should be fine. However, does macvtap really bypass everything, including the qdisc layer? The qdisc layer is certainly capable of looping the skb back with the redirect action. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] skbuff: clear tx zero-copy flag
On Mon, Jul 25, 2011 at 12:44:14PM +0300, Michael S. Tsirkin wrote: if yes that seems to always clone an skb, which in turn does the copy so we are fine? Yes you're right, it should be safe. However, I think we should add a WARN_ON to the splice skb path so that should a packet find its way through a path that we haven't thought of then at least we'll know about it. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] skbuff: clear tx zero-copy flag
Shirley Ma mashi...@us.ibm.com wrote: This patch clears tx zero-copy flag as needed. Sign-off-by: Shirley Ma x...@us.ibm.com I think we also need to copy and clear this flag on the splice read path as that takes a direct page reference. I hope there isn't any other path that does this. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop
Shirley Ma mashi...@us.ibm.com wrote: + /* Drop packet instead of stop queue for better performance */ I would like to see some justification as to why this is the right way to go and not just papering over the real problem. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: add CHECKSUM target
On Sun, Jul 11, 2010 at 12:21:17PM +0300, Michael S. Tsirkin wrote: I'd think that this target would be protocol-agnostic, no? Meaning it should go into net/netfilter/? Will do. Right, and you should get rid of any IPv4 dependencies. Thanks, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: add CHECKSUM target
On Fri, Jul 09, 2010 at 01:29:13AM +0300, Michael S. Tsirkin wrote: This adds a `CHECKSUM' target, which can be used in the iptables mangle table. You can use this target to compute and fill in the checksum in an IP packet that lacks a checksum. This is particularly useful, if you need to work around old applications such as dhcp clients, that do not work well with checksum offloads, but don't want to disable checksum offload in your device. The problem happens in the field with virtualized applications. For reference, see Red Hat bz 60, as well as http://www.spinics.net/lists/kvm/msg37660.html Typical expected use (helps old dhclient binary running in a VM): iptables -A POSTROUTING -t mangle -p udp --dport 68 -j CHECKSUM --checksum-fill Signed-off-by: Michael S. Tsirkin m...@redhat.com I'd think that this target would be protocol-agnostic, no? Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote: OK, if I understand you correctly then I don't think have a problem. With your current patch-set you have exactly the same situation when the skb-data is reallocated as a kernel buffer. When will skb-data to be reallocated? May you point me the code path? Anything that calls pskb_expand_head. This is OK because as you correctly argue, it is a rare situation. With my proposal you will need to get this extra external buffer in even less cases, because you'd only need to do it if the skb head grows, which only happens if it becomes encapsulated. So let me explain it in a bit more detail: Our packet starts out as a purely non-linear skb, i.e., skb-head contains nothing and all the page frags come from the guest. During host processing we may pull data into skb-head but the first frag will remain unless we pull all of it. If we did do that then you would have a free external buffer anyway. Now in the common case the header may be modified or pulled, but it very rarely grows. So you can just copy the header back into the first frag just before we give it to the guest. Since the data is still there, so recompute the page offset and size is ok, right? Right, you just move the page offset back and increase the size. However, to do this safely we'd need to have a way of knowing whether the skb head has been modified. It may well turn out to be just as effective to do something like if (memcmp(skb-data, page frag head, skb_headlen)) memcpy(page frag head, skb-data, skb_headlen) As the page frag head should be in cache since it would've been used to populate skb-data. It'd be good to run some benchmarks with this to see whether adding a bit to sk_buff to avoid the memcmp is worth it or not. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote: In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :( OK, if I understand you correctly then I don't think have a problem. With your current patch-set you have exactly the same situation when the skb-data is reallocated as a kernel buffer. This is OK because as you correctly argue, it is a rare situation. With my proposal you will need to get this extra external buffer in even less cases, because you'd only need to do it if the skb head grows, which only happens if it becomes encapsulated. So let me explain it in a bit more detail: Our packet starts out as a purely non-linear skb, i.e., skb-head contains nothing and all the page frags come from the guest. During host processing we may pull data into skb-head but the first frag will remain unless we pull all of it. If we did do that then you would have a free external buffer anyway. Now in the common case the header may be modified or pulled, but it very rarely grows. So you can just copy the header back into the first frag just before we give it to the guest. Only in the case where the packet header grows (e.g., encapsulation) would you need to get an extra external buffer. Cheers, -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote: I mean once the frontend side driver post the buffers to the backend driver, the backend driver will immediately use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :( OK I see what you mean. Can you tell me how does Xiaohui's previous patch-set deal with this problem? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote: Xiaohui Herbert: Mixing copy of head 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then. Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier. I'm not sure I understand you concern. If you mean that when the guest doesn't give enough pages to the host and the host can't receive on behalf of the guest then isn't that already the case with the original patch-set? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote: Changing the guest virtio to match the backend is a problem, this breaks migration etc. As long as it's done in a backwards compatible way it should be fine. It's just like migrating from a backend that supports TSO to one that doesn't. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote: It's just like migrating from a backend that supports TSO to one that doesn't. Exactly. We don't support such migration. Well that's something that has to be addressed in the virtio_net. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote: Rather than modifying all guests, it seems much easier not to assume specific buffer layout in host. Copying network header around seems a small cost. Well sure we can debate the specifics of this implementation detail. However, the fact that virtio_net doesn't support feature renegotiation on live migration is not a valid reason against this. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote: Let's do this then. So far the virtio spec avoided making layout assumptions, leaving guests lay out data as they see fit. Isn't it possible to keep supporting this with zero copy for hardware that can issue DMA at arbitrary addresses? I think you're mistaken with respect to what is being proposed. Raising 512 bytes isn't a hard constraint, it is merely an optimisation for Intel NICs because their PS mode can produce a head fragment of up to 512 bytes. If the guest didn't allocate 512 bytes it wouldn't be the end of the world, it'd just mean that we'd either copy whatever is in the head fragment, or we waste 4096-X bytes of memory where X is the number of bytes in the head. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote: Herbert, I have questions about the idea above: 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), then we don't need napi_gro_frags() any more, the driver's original receiving function is ok. Right? Well I was actually thinking about converting all drivers that need this to napi_gro_frags. But now that you mention it, yes we could still keep the old interface to minimise the work. 2) Is napi_gro_frags() only suitable for TCP protocol packet? I have done a small test for ixgbe driver to let it only allocate paged buffers and found kernel hangs when napi_gro_frags() receives an ARP packet. It should work with any packet. In fact, I'm pretty sure the other drivers (e.g., cxgb3) use that interface for all packets. 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate as usual, the data pointed by skb-data will be copied into the first guest buffer. That means we should reserve sufficient room in guest buffer. For PS mode supported driver (for example ixgbe), the room will be more than 128. After 128bytes, we will put the first frag data. Look into virtio-net.c the function page_to_skb() and receive_mergeable(), that means we should modify guest virtio-net driver to compute the offset as the parameter for skb_set_frag(). How do you think about this? Attached is a patch to how to modify the guest driver. I reserve 512 bytes as an example, and transfer the header len of the skb in hdr-hdr_len. Expanding the buffer size to 512 bytes to accomodate PS mode looks reasonable to me. However, I don't think we should increase the copy threshold to 512 bytes at the same time. I don't have any figures myself but I think if we are to make such a change it should be a separate one and come with supporting numbers. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote: Thanks for the verification. By the way, does that mean that nearly all drivers can use the same napi_gro_frags() to receive buffers though currently each driver has it's own receiving function? There is no reason why the napi_gro_frags can't be used by any driver that supports receiving data into pages. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sat, Jun 12, 2010 at 05:31:10PM +0800, Xin, Xiaohui wrote: 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed. If the driver support PS mode, then modify alloc_page() too. Well if you were doing this then the driver won't be generating skbs at all. So you don't need to change netdev_alloc_skb. The driver would currently do alloc_page, so you would replace that with netdev_alloc_page, which can call your new function to allocate an external page where appropriate. IOW you just need one change in the driver if it already uses the skbless interface, to replace with alloc_page. If the driver doesn't use the skbless interface then you need to make a few more changes but it isn't too hard either, it'll also mean that the driver will have less overhead even for normal use which is a win-win situation. 2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving function. 3) napi_gro_frags() will allocate small skb and pull the header data from the first page to skb-data. Is above the way what you have suggested? Yes. I have thought something in detail about the way. 1) The first page will have an offset after the header is copied into allocated kernel skb. The offset should be recalculated when the user page data is transferred to guest. This may modify some of the gro code. We could keep track whether the stack has modified the header, since you can simply ignore it if it doesn't modify it, which should be the common case for virt. 2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot put a user page as normally. This may modify the gro code too. If it does anything like that, then we're not in the fast-path case so you can just fall back to copying. 3) When the user buffer returned to guest, some of them need to be appended a vnet header. That means for some pages, the vnet header room should be reserved when allocated. But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong. I don't see why this would be a problem, since as far as what the driver is putting onto the physical RX ring nothing has changed. IOW if you want to designate a certain page as special, or the first page, you can still do so. So can you explain which bits of your patches would be affected by this? 4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo? While I don't like that very much I guess I can live with that if nobody else objects. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote: Herbert, In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers. We can also have another way here. We can provide a function to only substitute alloc_page(), and a function to release the pages when cleaning the rx buffers. Yes that's exactly what I had in mind. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote: Still not sure this is a good idea for a couple of reasons: 1. We already have lots of special cases with skb's (frags and fraglist), and skb's travel through a lot of different parts of the kernel. So any new change like this creates lots of exposed points for new bugs. Look at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec and ppp and ... 2. SKB's can have infinite lifetime in the kernel. If these buffers come from a fixed size pool in an external device, they can easily all get tied up if you have a slow listener. What happens then? I agree with Stephen on this. FWIW I don't think we even need the external pages concept in order to implement zero-copy receive (which I gather is the intent here). Here is one way to do it, simply construct a completely non-linear packet in the driver, as you would if you were using the GRO frags interface (grep for napi_gro_frags under drivers/net for examples). This way you can transfer the entire contents of the packet without copying through to the other side, provided that the host stack does not modify the packet. If the host side did modify the packet then we have to incur the memory cost anyway. IOW I think the only feature provided by the external pages construct is allowing the skb-head area to be shared without copying. I'm claiming that this can be done by simply making skb-head empty. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote: This isn't in 2.6.27.y. Herbert, can you send it there? It appears that now that TX is fixed we have a similar problem with RX. Once I figure that one out I'll send them together. Who is maintaining that BTW, sta...@kernel.org? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Thu, Dec 17, 2009 at 01:15:46PM -0500, rek2 wrote: I been told that today the network when down again and one of the guys here had to log using the console and restart it for that particular guests.. on the guest: uname -a Linux 2.6.27.25-170.2.72.fc10.x86_64 #1 SMP Sun Jun 21 18:39:34 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Next time it goes down I will try to run a sniffer and try both sides. OK I'm fairly sure this version has a buggy virtio-net. Does this patch (if it applies :) help? diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9eec5a5..74b3854 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -521,8 +521,10 @@ static void xmit_tasklet(unsigned long data) vi-svq-vq_ops-kick(vi-svq); vi-last_xmit_skb = NULL; } - if (vi-free_in_tasklet) + if (vi-free_in_tasklet) { free_old_xmit_skbs(vi); + netif_wake_queue(vi-dev); + } netif_tx_unlock_bh(vi-dev); } Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Wed, Dec 16, 2009 at 02:17:04PM +0200, Avi Kivity wrote: On 12/14/2009 05:49 PM, rek2 wrote: Hello, we notice that when we stress any of our guests, in this case they are all fedora, the KVM network will shutdown.. anyone experience this? Herbert? What's the exact guest kernel version? When the network is down, please get onto the guest console to determine which direction (if not both) of the network is not functioning. You can run tcpdump in the guest/host and execute pings on both sides to see which direction is blocked. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
On Tue, Jun 09, 2009 at 10:51:23PM -0700, Sridhar Samudrala wrote: Unfortunately, this doesn't work for UDP without any changes. skb_segment() currently adds transport header to each segmented skb. But when we are using IP fragmentation, only the first fragment should include the UDP header. We either need to fix skb_segment() to handle IP fragmentation or write a new skb_fragment(). I will look into this when i get some time. Couldn't you get around this by setting skb-transport_header to skb-network_header before getting into skb_segment? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources
On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote: + /* Software UFO is not yet supported */ + segs = ERR_PTR(-EPROTONOSUPPORT); Hmm, we need to fill this in before you start using it for virt. After all, it's very difficult for the guest to know whether the output device on the host is going to be able to do UFO or not. The whole reason I did GSO in the first place is to allow the guest not to worry about whether the host supported TSO in hardware. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Fri, Apr 03, 2009 at 01:18:54PM +0200, Andi Kleen wrote: Check shared ring status when stuffing a request. If there are requests That means you're bouncing cache lines all the time. Probably not a big issue on single socket but could be on larger systems. If the backend is running on a core that doesn't share caches with the guest queue then you've got bigger problems. Right this is unavoidable for guests with many CPUs but that should go away once we support multiqueue in virtio-net. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote: The host writes the packet to tap, at which point it is consumed from its point of view. The host would like to mention that if there was an API to notify it when the packet was actually consumed, then it would gladly use it. Bonus points if this involves not copying the packet. We're using write(2) for this, no? That should invoke netif_rx_ni which blocks until the packet is processed, which usually means that it's placed on the NIC's hardware queue. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Avi Kivity a...@redhat.com wrote: virtio is already non-kvm-specific (lguest uses it) and non-pci-specific (s390 uses it). I think Greg's work shows that putting the backend in the kernel can dramatically reduce the cost of a single guest-host transaction. I'm sure the same thing would work for virtio too. If you have a good exit mitigation scheme you can cut exits by a factor of 100; so the userspace exit costs are cut by the same factor. If you have good copyless networking APIs you can cut the cost of copies to zero (well, to the cost of get_user_pages_fast(), but a kernel solution needs that too). Given the choice of having to mitigate or not having the problem in the first place, guess what I would prefer :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote: I don't understand this. If we had good interfaces, all that userspace would do is translate guest physical addresses to host physical addresses, and translate the guest-host protocol to host API calls. I don't see anything there that benefits from being in the kernel. Can you elaborate? I think Greg has expressed it clearly enough. At the end of the day, the numbers speak for themselves. So if and when there's a user-space version that achieves the same or better results, then I will change my mind :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote: There is no choice. Exiting from the guest to the kernel to userspace is prohibitively expensive, you can't do that on every packet. I was referring to the bit between the kernel and userspace. In any case, I just looked at the virtio mitigation code again and I am completely baffled at why we need it. Look at Greg's code or the netback/netfront notification, why do we need this completely artificial mitigation when the ring itself provides a natural way of stemming the flow? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote: If tap told us when the packets were actually transmitted, life would be wonderful: And why do we need this? Because we are in user space! I'll continue to wait for your patch and numbers :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 05:29:36PM +0800, Herbert Xu wrote: On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote: If tap told us when the packets were actually transmitted, life would be wonderful: And why do we need this? Because we are in user space! I'll continue to wait for your patch and numbers :) And in case you're working on that patch, this might interest you. Check out the netdev thread titled TX time stamping. Now that we assign the tap skb with its own sk, these two scenarios are pretty much identical. I also noitced despite davem's threats to revert the patch, it has now made Linus's tree :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote: Why does a kernel solution not need to know when a packet is transmitted? Because you can install your own destructor? I don't know what Greg did, but netback did that nasty page destructor hack which Jeremy is trying to undo :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 12:43:54PM +0300, Avi Kivity wrote: So we're back to the problem is with the kernel-user interface, not userspace being cursed into slowness. Well until you have a patch + numbers that's only an allegation :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 12:03:32PM +0300, Avi Kivity wrote: Like Anthony said, the problem is with the kernel-user interfaces. We won't have a good user space virtio implementation until that is fixed. If it's just the interface that's bad, then it should be possible to do a proof-of-concept patch to show that this is the case. Even if we have to redesign the interface, at least you can then say that you guys were right all along :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote: 3ms latency for ping? (ping will always be scheduled immediately when the reply arrives if I understand cfs, so guest load won't delay it) That only happens if the guest immediately does some CPU-intensive computation 3ms and assuming its timeslice lasts that long. In any case, the same thing will happen right now if the host or some other guest on the same CPU hogs the CPU for 3ms. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
On Fri, Apr 03, 2009 at 01:06:10AM +0800, Herbert Xu wrote: That only happens if the guest immediately does some CPU-intensive computation 3ms and assuming its timeslice lasts that long. In any case, the same thing will happen right now if the host or some other guest on the same CPU hogs the CPU for 3ms. Even better, look at the packet's TOS. If it's marked for low- latency then vmexit immediately. Otherwise continue. In the backend you'd just set the marker in shared memory. Of course invert this for the host = guest direction. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Anthony Liguori anth...@codemonkey.ws wrote: Anyway, if we're able to send this many packets, I suspect we'll be able to also handle much higher throughputs without TX mitigation so that's what I'm going to look at now. Awesome! I'm prepared to eat my words :) On the subject of TX mitigation, can we please set a standard on how we measure it? For instance, do we bind the the backend qemu to the same CPU as the guest, or do we bind it to a different CPU that shares cache? They're two completely different scenarios and I think we should be explicit about which one we're measuring. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Rusty Russell ru...@rustcorp.com.au wrote: As you point out, 350-450 is possible, which is still bad, and it's at least partially caused by the exit to userspace and two system calls. If virtio_net had a backend in the kernel, we'd be able to compare numbers properly. FWIW I don't really care whether we go with this or a kernel virtio_net backend. Either way should be good. However the status quo where we're stuck with a user-space backend really sucks! Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Anthony Liguori anth...@codemonkey.ws wrote: That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. I'm sorry but I totally disagree with that. By having our IO infrastructure in user-space we've basically given up the main advantage of kvm, which is that the physical drivers operate in the same environment as the hypervisor. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/17] virtual-bus
Chris Wright chr...@sous-sol.org wrote: That said, I don't think we're bound today by the fact that we're in userspace. Rather we're bound by the interfaces we have between the host kernel and userspace to generate IO. I'd rather fix those interfaces than put more stuff in the kernel. And more stuff in the kernel can come at the potential cost of weakening protection/isolation. Protection/isolation always comes at a cost. Not everyone wants to pay that, just like health insurance :) We should enable the users to choose which model they want, based on their needs. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copyless virtio net thoughts?
On Wed, Feb 18, 2009 at 10:08:00PM +1030, Rusty Russell wrote: 4) Multiple queues This is Herbert's. Should be fairly simple to add; it was in the back of my mind when we started. Not sure whether the queues should be static or dynamic (imagine direct interguest networking, one queue pair for each other guest), and how xmit queues would be selected by the guest (anything anywhere, or dst mac?). The primary purpose of multiple queues is to maximise CPU utilisation, so the number of queues is simply dependent on the number of CPUs allotted to the guest. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copyless virtio net thoughts?
On Thu, Feb 05, 2009 at 02:37:07PM +0200, Avi Kivity wrote: I believe that copyless networking is absolutely essential. I used to think it was important, but I'm now of the opinion that it's quite useless for virtualisation as it stands. For transmit, copyless is needed to properly support sendfile() type workloads - http/ftp/nfs serving. These are usually high-bandwidth, cache-cold workloads where a copy is most expensive. This is totally true for baremetal, but useless for virtualisation right now because the block layer is not zero-copy. That is, the data is going to be cache hot anyway so zero-copy networking doesn't buy you much at all. Please also recall that for the time being, block speeds are way slower than network speeds. So the really interesting case is actually network-to-network transfers. Again due to the RX copy this is going to be cache hot. For receive, the guest will almost always do an additional copy, but it will most likely do the copy from another cpu. Xen netchannel2 That's what we should strive to avoid. The best scenario with modern 10GbE NICs is to stay on one CPU if at all possible. The NIC will pick a CPU when it delivers the packet into one of the RX queues and we should stick with it for as long as possible. So what I'd like to see next in virtualised networking is virtual multiqueue support in guest drivers. No I'm not talking about making one or more of the physical RX/TX queues available to the guest (aka passthrough), but actually turning something like the virtio-net interface into a multiqueue interface. This is the best way to get cache locality and minimise CPU waste. So I'm certainly not rushing out to do any zero-copy virtual networking. However, I would like to start working on a virtual multiqueue NIC interface. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] virtio_net: Add a MAC filter table
On Thu, Jan 29, 2009 at 10:25:46AM +1030, Rusty Russell wrote: So, this conversation has convinced me that the host should accept arbitrary filtering entries, and the guest should accept that it is best effort. I agree with this completely. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Fri, Aug 15, 2008 at 02:47:12PM -0500, Anthony Liguori wrote: +static void work_around_broken_dhclient(struct virtio_net_hdr *hdr, +const uint8_t *buf, size_t size) +{ +if ((hdr-flags VIRTIO_NET_HDR_F_NEEDS_CSUM) /* missing csum */ +(size 18 size 1500) /* normal sized MTU */ +(buf[12] == 0x08 buf[13] == 0x00) /* ethertype == IPv4 */ +(buf[23] == 17)) { /* ip.protocol == UDP */ +/* FIXME this cast is evil */ +net_checksum_calculate((uint8_t *)buf, size); If we're going to do this, how about just setting the checksum to zero? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Tue, Aug 19, 2008 at 07:08:23PM +1000, Rusty Russell wrote: Not really. We could extend the protocol, but that's currently how feature negotiation works: you can't do it while the device is live. That seemed simplest. I learnt from Xen :) (Of course, we don't need to *disable* it, we need to *enable* it). I don't see why we shouldn't support disabling it. After all, any NIC that supports receive checksum offload allows it to be disabled. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Mon, Aug 18, 2008 at 02:40:55PM +0300, Avi Kivity wrote: Isn't that turned on automatically for real hardware? And what's to prevent a broken dhclient together with the (presumably) hacked up initscripts that call ethtool? Well the idea is that only a fixed guest would even know about enabling this. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Work around dhclient brokenness
On Mon, Aug 18, 2008 at 10:13:55PM -0700, Chris Wedgwood wrote: CSUM2 sounds so ugly though. Features seem to get added and never removed how about if this had a documented short lifetime (if it really must go in)? All we need is a simple toggle to disable checksum offload. Every NIC that offers receive checksum offload allows it to be disabled. virtio shouldn't be any different. Resetting the NIC seems a bit over the top. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] virtio_net perf patches
On Tue, Aug 12, 2008 at 07:12:28PM +0100, Mark McLoughlin wrote: Are you basically just saying that guests with a broken dhclient should manually disable rx checksum offload with ethtool? And that the host should react by disabling tx offload on the tap interfacE? No I'm saying that everybody should default to no checksum offload. Those that have working kernels + user-space can then enable it on boot-up. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Fix partial csum rx handling
On Tue, May 27, 2008 at 12:36:06PM +0100, Mark McLoughlin wrote: When we receive a packet with a partial csum, just set ip_summed to CHECKSUM_UNNECESSARY. Signed-off-by: Mark McLoughlin [EMAIL PROTECTED] Cc: Herbert Xu [EMAIL PROTECTED] I'm not sure I follow the context of this patch. Could you explain what it's for? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html