Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote: Hello Michael, On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote: Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley If you look at dev_hard_start_xmit you will see a call to skb_orphan_try which often calls the skb destructor. So I suspect this is almost equivalent to your original patch, and has the same correctness issue. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, Sep 29, 2010 at 10:16:45AM +0200, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote: Hello Michael, On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote: Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley If you look at dev_hard_start_xmit you will see a call to skb_orphan_try which often calls the skb destructor. So I suspect this is almost equivalent to your original patch, and has the same correctness issue. So you could try doing skb_tx(skb)-prevent_sk_orphan = 1 just to see what will happen. Might be interesting - just make sure the device doesn't orphan the skb first thing. I suspect lack of parallelism will result in bad throughput esp for small messages. Note this still won't make it correct (this has module unloading issue, and devices might still orphan skb, clone it, or hang on to paged data in some other way) but at least closer. I think you should try testing with guest to external communication, this will uncover some of these correctness issues for you. I think netperf also has some flag to check data, might be a good idea to use it for testing. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-29 at 10:16 +0200, Michael S. Tsirkin wrote: If you look at dev_hard_start_xmit you will see a call to skb_orphan_try which often calls the skb destructor. So I suspect this is almost equivalent to your original patch, and has the same correctness issue. I forgot to mention, in skb_orphan_try, I still kept the skb-sk reference to sk, I assign skb-sk to NULL when the DMA done. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-29 at 10:28 +0200, Michael S. Tsirkin wrote: I think you should try testing with guest to external communication, this will uncover some of these correctness issues for you. I think netperf also has some flag to check data, might be a good idea to use it for testing. I always tested guest to remote host, remote host to guest. What other external communication do you suggest here? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-29 at 10:16 +0200, Michael S. Tsirkin wrote: I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley If you look at dev_hard_start_xmit you will see a call to skb_orphan_try which often calls the skb destructor. So I suspect this is almost equivalent to your original patch, and has the same correctness issue. If I didn't address this, then vhost net will wait for ever with wait for completion :)) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-29 at 10:28 +0200, Michael S. Tsirkin wrote: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley If you look at dev_hard_start_xmit you will see a call to skb_orphan_try which often calls the skb destructor. So I suspect this is almost equivalent to your original patch, and has the same correctness issue. So you could try doing skb_tx(skb)-prevent_sk_orphan = 1 just to see what will happen. Might be interesting - just make sure the device doesn't orphan the skb first thing. I suspect lack of parallelism will result in bad throughput esp for small messages. Note this still won't make it correct (this has module unloading issue, and devices might still orphan skb, clone it, or hang on to paged data in some other way) but at least closer. For message size smaller than 128, it still does copy. I tested some small message size, I didn't see any regression. I will run more test to focus on small message size between 128 - 4K. I don't need prevent_sk_orphan since in skb_release_data for last reference, I just need the ZEROCOPY flag from that sock to signal a completion. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote: Hello Michael, On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote: Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley I guess I just don't understand what your patch does. If you send it, I can take a look. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-29 at 17:14 +0200, Michael S. Tsirkin wrote: I guess I just don't understand what your patch does. If you send it, I can take a look. Ok, I will collect more performance data and send the patch. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
Hello Michael, On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote: Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 5:59 PM To: Xin, Xiaohui Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? I have trouble parsing the above question. At some point Arnd suggested that the mp device functionality would fit nicely as part of the macvtap driver. It seems to make sense superficially, the advantage if it worked would be that we would get zero copy (mostly) transparently. Do you agree with this goal? I would say yes. Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. tap would be very hard at this point as it does not bind to a device. macvtap might work, we mainly need to figure out a way to detect that device can do zero copy so the right mode is used. I think a first step could be to simply link mp code into macvtap module, pass necessary ioctls on, then move some code around as necessary. This might get rid of code duplication nicely. I'll look into this to see how much effort would be. Do you think that's worth to do and help current process which is blocked too long than I expected? I think it's nice to have. And if done hopefully this will get the folk working on the macvtap driver to review the code, which will help find all issues faster. I think if you post some performance numbers, this will also help get people excited and looking at the code. The performance data I have posted before is compared with raw socket on vhost-net. But currently, the raw socket backend is removed from the qemu side. So I can only compare with tap on vhost-net. But unfortunately, I missed something that I even can't bring it up. I was blocked by this for a time. I also don't see the process as completely blocked, each review round points out more issues: we aren't going back and forth changing same lines again and again, are we? One thing that might help is increase the frequency of updates, try sending them out sooner. On the other hand 10 new patches each revision is a lot: if there is a part of patchset that has stabilised you can split it out, post once and keep posting the changing part separately. I hope these suggestions help. Thanks, Michael! -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Thu, Sep 16, 2010 at 04:18:10PM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 5:59 PM To: Xin, Xiaohui Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? I have trouble parsing the above question. At some point Arnd suggested that the mp device functionality would fit nicely as part of the macvtap driver. It seems to make sense superficially, the advantage if it worked would be that we would get zero copy (mostly) transparently. Do you agree with this goal? I would say yes. In that case, it's a blocker for upstream merge because this change affects userspace. Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. tap would be very hard at this point as it does not bind to a device. macvtap might work, we mainly need to figure out a way to detect that device can do zero copy so the right mode is used. I think a first step could be to simply link mp code into macvtap module, pass necessary ioctls on, then move some code around as necessary. This might get rid of code duplication nicely. I'll look into this to see how much effort would be. Do you think that's worth to do and help current process which is blocked too long than I expected? I think it's nice to have. And if done hopefully this will get the folk working on the macvtap driver to review the code, which will help find all issues faster. I think if you post some performance numbers, this will also help get people excited and looking at the code. The performance data I have posted before is compared with raw socket on vhost-net. But currently, the raw socket backend is removed from the qemu side. So I can only compare with tap on vhost-net. But unfortunately, I missed something that I even can't bring it up. I was blocked by this for a time. Hey, maybe you are seeing the bug that was reported recently. Could you try tcpdump -i on the tap interface in host and ethX on guest and tell me what you see? If you see packet in guest but not in host, could you try adding printks in vhost handle_tx to see whether it gets called and if yes where it fails? I also don't see the process as completely blocked, each review round points out more issues: we aren't going back and forth changing same lines again and again, are we? One thing that might help is increase the frequency of updates, try sending them out sooner. On the other hand 10 new patches each revision is a lot: if there is a part of patchset that has stabilised you can split it out, post once and keep posting the changing part separately. I hope these suggestions help. Thanks, Michael! -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 07:27 +0200, Michael S. Tsirkin wrote: For some of the issues, try following the discussion around net: af_packet: don't call tpacket_destruct_skb() until the skb is sent out. Summary: it's difficult to do correctly generally. Limiting ourselves to transmit on specific devices might make it possible. Thanks for the tips. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 07:12 +0200, Michael S. Tsirkin wrote: Yes, I agree this patch is useful for demo purposes: simple, and shows what kind of performance gains we can expect for TX. Any other issue you can see in this patch beside vhost descriptors update? Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote: From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? I have trouble parsing the above question. At some point Arnd suggested that the mp device functionality would fit nicely as part of the macvtap driver. It seems to make sense superficially, the advantage if it worked would be that we would get zero copy (mostly) transparently. Do you agree with this goal? Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. tap would be very hard at this point as it does not bind to a device. macvtap might work, we mainly need to figure out a way to detect that device can do zero copy so the right mode is used. I think a first step could be to simply link mp code into macvtap module, pass necessary ioctls on, then move some code around as necessary. This might get rid of code duplication nicely. Do you think that's worth to do and help current process which is blocked too long than I expected? I think it's nice to have. And if done hopefully this will get the folk working on the macvtap driver to review the code, which will help find all issues faster. I think if you post some performance numbers, this will also help get people excited and looking at the code. I also don't see the process as completely blocked, each review round points out more issues: we aren't going back and forth changing same lines again and again, are we? One thing that might help is increase the frequency of updates, try sending them out sooner. On the other hand 10 new patches each revision is a lot: if there is a part of patchset that has stabilised you can split it out, post once and keep posting the changing part separately. I hope these suggestions help. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 11:21:15PM -0700, Shirley Ma wrote: On Wed, 2010-09-15 at 07:12 +0200, Michael S. Tsirkin wrote: Yes, I agree this patch is useful for demo purposes: simple, and shows what kind of performance gains we can expect for TX. Any other issue you can see in this patch beside vhost descriptors update? Another issue is that macvtap can be bound to almost anything, including e.g. a tap device or a bridge, which might hang on to skb fragments for unlimited time. Zero copy TX won't easily work there. I can imagine either somehow triggering a data copy after the fact (hard), or detecting such devices and avoiding zero copy (unfortunate for guest to guest, and drivers will need tuning). Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 12:10 +0200, Michael S. Tsirkin wrote: Another issue is that macvtap can be bound to almost anything, including e.g. a tap device or a bridge, which might hang on to skb fragments for unlimited time. Zero copy TX won't easily work there. I can imagine either somehow triggering a data copy after the fact (hard), or detecting such devices and avoiding zero copy (unfortunate for guest to guest, and drivers will need tuning). So far macvtap zero copy patch is limited to lower devices supports high memory DMA, it doesn't apply to a tap device or a bridge. Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 17:39 +0200, Michael S. Tsirkin wrote: In fact, I rechecked: both bridge and loopback have NETIF_F_HIGHDMA set. So maybe we should check NETIF_F_NETNS_LOCAL ... macvtap in bridged mode is interesting as well. I found that too, just wondered which flag to use is better. :) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From maverick-changes-boun...@lists.ubuntu.com Wed Sep 15 10:00:38 2010 Return-path: maverick-changes-boun...@lists.ubuntu.com Envelope-to: arch...@mail-archive.com Delivery-date: Wed, 15 Sep 2010 10:00:38 -0700 Received: from exprod5mx266.postini.com ([64.18.0.89] helo=psmtp.com) by mail-archive.com with smtp (Exim 4.69) (envelope-from maverick-changes-boun...@lists.ubuntu.com) id 1OvvLa-0005Ig-QC for arch...@mail-archive.com; Wed, 15 Sep 2010 10:00:38 -0700 Received: from source ([91.189.94.204]) by exprod5mx266.postini.com ([64.18.4.10]) with SMTP; Wed, 15 Sep 2010 10:00:37 PDT Received: from localhost ([127.0.0.1] helo=chlorine.canonical.com) by chlorine.canonical.com with esmtp (Exim 4.69) (envelope-from maverick-changes-boun...@lists.ubuntu.com) id 1OvvLT-0006E2-1a; Wed, 15 Sep 2010 18:00:31 +0100 Received: from adelie.canonical.com ([91.189.90.139]) by chlorine.canonical.com with esmtp (Exim 4.69) (envelope-from boun...@canonical.com) id 1OvvLQ-0006DP-Ih for maverick-chan...@lists.ubuntu.com; Wed, 15 Sep 2010 18:00:28 +0100 Received: from cocoplum.canonical.com ([91.189.90.15]) by adelie.canonical.com with esmtp (Exim 4.69 #1 (Debian)) id 1OvvLQ-pq-40 for maverick-chan...@lists.ubuntu.com; Wed, 15 Sep 2010 18:00:28 +0100 Received: from cocoplum.canonical.com (localhost [127.0.0.1]) by cocoplum.canonical.com (Postfix) with ESMTP id 195C816C0A6 for maverick-chan...@lists.ubuntu.com; Wed, 15 Sep 2010 17:00:28 + (UTC) Content-Type: multipart/mixed; boundary7323435133034647172== MIME-Version: 1.0 from: Fabrice Coutadeur fabric...@ubuntu.com subject: [ubuntu/maverick] t38modem 1.2.0-1build1 (Accepted) to: maverick-chan...@lists.ubuntu.com X-Launchpad-Component: component=universe, section=comm X-Katie: Launchpad actually Message-Id: 20100915170028.11124.85014.launch...@cocoplum.canonical.com Date: Wed, 15 Sep 2010 17:00:28 - Precedence: bulk X-Generated-By: Launchpad (canonical.com); Revision=None; Instance=initZopeless config overlay X-Launchpad-Hash: 9370f3bb58447beb8fcb7983a6adb4937501c276 X-BeenThere: maverick-chan...@lists.ubuntu.com X-Mailman-Version: 2.1.9 Reply-To: Fabrice Coutadeur fabric...@ubuntu.com List-Id: Maverick Meerkat archive upload notification list maverick-changes.lists.ubuntu.com List-Unsubscribe: https://lists.ubuntu.com/mailman/listinfo/maverick-changes, mailto:maverick-changes-requ...@lists.ubuntu.com?subject=unsubscribe List-Archive: https://lists.ubuntu.com/archives/maverick-changes List-Post: mailto:maverick-chan...@lists.ubuntu.com List-Help: mailto:maverick-changes-requ...@lists.ubuntu.com?subject=help List-Subscribe: https://lists.ubuntu.com/mailman/listinfo/maverick-changes, mailto:maverick-changes-requ...@lists.ubuntu.com?subject=subscribe Sender: maverick-changes-boun...@lists.ubuntu.com Errors-To: maverick-changes-boun...@lists.ubuntu.com X-pstn-levels: (S:99.9/99.9 CV:99.9000 FC:95.5390 LC:95.5390 R:95.9108 P:95.9108 M:97.0282 C:98.6951 ) X-pstn-settings: 4 (1.5000:1.5000) s cv gt3 gt2 gt1 r p m c X-pstn-addresses: from fabric...@ubuntu.com [294/10] --===7323435133034647172== Content-Type: text/plain; charset=utf-8 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable t38modem (1.2.0-1build1) maverick; urgency=3Dlow * No changes upload for libopal3.6.6 - libopal3.6.8 transition Date: Wed, 15 Sep 2010 08:18:57 +0200 Changed-By: Fabrice Coutadeur fabric...@ubuntu.com Maintainer: Debian VoIP Team pkg-voip-maintain...@lists.alioth.debian.org Signed-By: Fabrice Coutadeur coutade...@gmail.com https://launchpad.net/ubuntu/maverick/+source/t38modem/1.2.0-1build1 --===7323435133034647172== Content-Type: text/plain; charset=utf-8 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename=changesfile -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Format: 1.8 Date: Wed, 15 Sep 2010 08:18:57 +0200 Source: t38modem Binary: t38modem Architecture: source Version: 1.2.0-1build1 Distribution: maverick Urgency: low Maintainer: Debian VoIP Team pkg-voip-maintain...@lists.alioth.debian.org Changed-By: Fabrice Coutadeur fabric...@ubuntu.com Description: = t38modem - T.38 Fax over IP pseudo modem Changes: = t38modem (1.2.0-1build1) maverick; urgency=3Dlow . * No changes upload for
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, Sep 15, 2010 at 10:00:04AM -0700, Shirley Ma wrote: On Wed, 2010-09-15 at 17:39 +0200, Michael S. Tsirkin wrote: In fact, I rechecked: both bridge and loopback have NETIF_F_HIGHDMA set. So maybe we should check NETIF_F_NETNS_LOCAL ... macvtap in bridged mode is interesting as well. I found that too, just wondered which flag to use is better. :) Thanks Shirley At some level NETIF_F_NETNS_LOCAL makes sense: local packets can get anywhere. OTOH one wonders whether there might be other issues, e.g. in theory devices could hang on to frag pages just by doing get_page. There might be other issues. Maybe we are better off white-listing known-good drivers with a new flag? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 19:30 +0200, Michael S. Tsirkin wrote: At some level NETIF_F_NETNS_LOCAL makes sense: local packets can get anywhere. OTOH one wonders whether there might be other issues, e.g. in theory devices could hang on to frag pages just by doing get_page. There might be other issues. Maybe we are better off white-listing known-good drivers with a new flag? Sounds reasonable to me if there is no such a flag to be easily used. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: +base = (unsigned long)from-iov_base + offset1; +size = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; +num_pages = get_user_pages_fast(base, size, 0,page[i]); +if ((num_pages != size) || +(num_pages MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags)) +/* put_page is in skb free */ +return -EFAULT; What keeps the user from writing to these pages in it's address space after the write call returns? A write() return of success means: I wrote what you gave to me not I wrote what you gave to me, oh and BTW don't touch these pages for a while. In fact a while isn't even defined in any way, as there is no way for the write() invoker to know when the networking card is done with those pages. That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tuesday 14 September 2010, Shirley Ma wrote: On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. I guess the best solution would be to make macvtap_aio_write return -EIOCBQUEUED when a packet gets passed down to the adapter, and call aio_complete when the adapter is done with it. This would change the regular behavior of macvtap into a model where every write on the file blocks until the packet has left the machine, which gives us better flow control, but does slow down the traffic when we only put one packet at a time into the queue. It also allows the user to call io_submit instead of write in order to do an asynchronous submission as Avi was suggesting. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote: Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? I thought host pins guest kernel buffer pages was good enough for TX thought I didn't look up xiaohui's vhost asycn io patch in details. What's the performance data Xiaohui got from using kiocb? I haven't seen any performance number from him yet. The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? The simple integration is using macvtap + xiaohui's vhost asycn io patch. I can make a try for TX only. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 10:02:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote: Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? I thought host pins guest kernel buffer pages was good enough for TX thought I didn't look up xiaohui's vhost asycn io patch in details. As others said, the harder issues for TX are in determining that it's safe to unpin the memory, and how much memory is it safe to pin to beging with. For RX we have some more complexity. What's the performance data Xiaohui got from using kiocb? I haven't seen any performance number from him yet. The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? The simple integration is using macvtap + xiaohui's vhost asycn io patch. I can make a try for TX only. Thanks Shirley Well it's up to you of course, but this is not what I would try: if you look at the patchset vhost changes is not the largest part of it, so this sounds a bit like effort duplication. TX only is also much less interesting than full zero copy. I think that you should be able to simply combine the two drivers together, add an ioctl to enable/disable zero copy mode of operation. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote: As others said, the harder issues for TX are in determining that it's safe to unpin the memory, and how much memory is it safe to pin to beging with. For RX we have some more complexity. I think unpin the memory is in kfree_skb() whenever the last reference is gone for TX. What we discussed about here is when/how vhost get notified to update ring buffer descriptors. Do I misunderstand something here? Well it's up to you of course, but this is not what I would try: if you look at the patchset vhost changes is not the largest part of it, so this sounds a bit like effort duplication. TX only is also much less interesting than full zero copy. It's not true. From the performance, TX only has gained big improvement. We need to identify how much gain from TX zero copy, and how much gain from RX zero copy. I think that you should be able to simply combine the two drivers together, add an ioctl to enable/disable zero copy mode of operation. That could work. But what's the purpose to have two drivers if one driver can handle it? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote: As others said, the harder issues for TX are in determining that it's safe to unpin the memory, and how much memory is it safe to pin to beging with. For RX we have some more complexity. I think unpin the memory is in kfree_skb() whenever the last reference is gone for TX. What we discussed about here is when/how vhost get notified to update ring buffer descriptors. Do I misunderstand something here? Right, that's a better way to put it. Well it's up to you of course, but this is not what I would try: if you look at the patchset vhost changes is not the largest part of it, so this sounds a bit like effort duplication. TX only is also much less interesting than full zero copy. It's not true. From the performance, TX only has gained big improvement. We need to identify how much gain from TX zero copy, and how much gain from RX zero copy. I was speaking from the code point of view: since we'll want both TX and RX eventually it's nice to see that some thought was given to RX even if we only merge TX as a first step. From the product POV, RX is already slower (more interrupts, etc) than TX so speeding it up might be more important, but I agree, every bit helps. I think that you should be able to simply combine the two drivers together, add an ioctl to enable/disable zero copy mode of operation. That could work. But what's the purpose to have two drivers if one driver can handle it? Thanks Shirley This was just an idea: I thought it's a good way for people interested in this zero copy thing to combine forces and avoid making the same mistakes, but it's not a must of course. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote: On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote: As others said, the harder issues for TX are in determining that it's safe to unpin the memory, and how much memory is it safe to pin to beging with. For RX we have some more complexity. I think unpin the memory is in kfree_skb() whenever the last reference is gone for TX. What we discussed about here is when/how vhost get notified to update ring buffer descriptors. Do I misunderstand something here? Right, that's a better way to put it. That's how this macvtap patch did. For how much pinned pages,it is limited by sk_wmem_alloc size in this patch. thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote: I think that you should be able to simply combine the two drivers together, add an ioctl to enable/disable zero copy mode of operation. That could work. But what's the purpose to have two drivers if one driver can handle it? Thanks Shirley This was just an idea: I thought it's a good way for people interested in this zero copy thing to combine forces and avoid making the same mistakes, but it's not a must of course. Ok, I will make a simple patch by reusing Xiaohui's some vhost code on handling vhost_add_used_and_signal() to see any performance changes. The interesting thing here when I run 32 instances netperf/netserver I didn't see any issue w/i this patch. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Shirley Ma [mailto:mashi...@us.ibm.com] Sent: Tuesday, September 14, 2010 11:05 PM To: Avi Kivity Cc: David Miller; a...@arndb.de; m...@redhat.com; Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: +base = (unsigned long)from-iov_base + offset1; +size = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; +num_pages = get_user_pages_fast(base, size, 0,page[i]); +if ((num_pages != size) || +(num_pages MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags)) +/* put_page is in skb free */ +return -EFAULT; What keeps the user from writing to these pages in it's address space after the write call returns? A write() return of success means: I wrote what you gave to me not I wrote what you gave to me, oh and BTW don't touch these pages for a while. In fact a while isn't even defined in any way, as there is no way for the write() invoker to know when the networking card is done with those pages. That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. Thanks Shirley I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Arnd Bergmann [mailto:a...@arndb.de] Sent: Tuesday, September 14, 2010 11:21 PM To: Shirley Ma Cc: Avi Kivity; David Miller; m...@redhat.com; Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tuesday 14 September 2010, Shirley Ma wrote: On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. I guess the best solution would be to make macvtap_aio_write return -EIOCBQUEUED when a packet gets passed down to the adapter, and call aio_complete when the adapter is done with it. This would change the regular behavior of macvtap into a model where every write on the file blocks until the packet has left the machine, which gives us better flow control, but does slow down the traffic when we only put one packet at a time into the queue. It also allows the user to call io_submit instead of write in order to do an asynchronous submission as Avi was suggesting. But currently, this patch is communicated with vhost-net, which is almost in the kernel side. If it uses aio stuff, it should be communicate with user space Backend. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote: I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Have you tried it? If so what's the performance penalty you have seen by notifying vhost when tx buffer freed? I am thinking to have a callback in skb destructor, vhost_add_used_and_signal gets updated when skb is actually freed, vhost vq head need to be passed to the callback. This might requires vhost ring size is at least as big as the lower device driver. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Wednesday, September 15, 2010 12:30 AM To: Shirley Ma Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote: I would expect this to hurt performance significantly. We could do this for asynchronous requests only to avoid the slowdown. Is kiocb in sendmsg helpful here? It is not used now. Shirley Precisely. This is what the patch from Xin Xiaohui does. That code already seems to do most of what you are trying to do, right? The main thing missing seems to be macvtap integration, so that we can fall back on data copy if zero copy is unavailable? How hard would it be to basically link the mp and macvtap modules together to get us this functionality? Anyone? Michael, Is to support macvtap with zero-copy through mp device the functionality you mentioned above? Before Shirley Ma has suggested to move the zero-copy functionality into tun/tap device or macvtap device. How do you think about that? I suspect there will be a lot of duplicate code in that three drivers except we can extract code of zero-copy into kernel APIs and vhost APIs. Do you think that's worth to do and help current process which is blocked too long than I expected? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Shirley Ma [mailto:mashi...@us.ibm.com] Sent: Wednesday, September 15, 2010 10:41 AM To: Xin, Xiaohui Cc: Avi Kivity; David Miller; a...@arndb.de; m...@redhat.com; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote: I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Have you tried it? If so what's the performance penalty you have seen by notifying vhost when tx buffer freed? We did not try it before, as we cared RX side more. I am thinking to have a callback in skb destructor, vhost_add_used_and_signal gets updated when skb is actually freed, vhost vq head need to be passed to the callback. This might requires vhost ring size is at least as big as the lower device driver. That's almost the same what we have done except we use destructor_arg and another callback.. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 12:36:23PM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote: I think that you should be able to simply combine the two drivers together, add an ioctl to enable/disable zero copy mode of operation. That could work. But what's the purpose to have two drivers if one driver can handle it? Thanks Shirley This was just an idea: I thought it's a good way for people interested in this zero copy thing to combine forces and avoid making the same mistakes, but it's not a must of course. Ok, I will make a simple patch by reusing Xiaohui's some vhost code on handling vhost_add_used_and_signal() to see any performance changes. The interesting thing here when I run 32 instances netperf/netserver I didn't see any issue w/i this patch. Thanks Shirley Yes, I agree this patch is useful for demo purposes: simple, and shows what kind of performance gains we can expect for TX. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 07:40:52PM -0700, Shirley Ma wrote: On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote: I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Have you tried it? If so what's the performance penalty you have seen by notifying vhost when tx buffer freed? I am thinking to have a callback in skb destructor, vhost_add_used_and_signal gets updated when skb is actually freed, vhost vq head need to be passed to the callback. This might requires vhost ring size is at least as big as the lower device driver. Thanks Shirley For some of the issues, try following the discussion around net: af_packet: don't call tpacket_destruct_skb() until the skb is sent out. Summary: it's difficult to do correctly generally. Limiting ourselves to transmit on specific devices might make it possible. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tue, Sep 14, 2010 at 12:20:29PM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote: On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote: On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote: As others said, the harder issues for TX are in determining that it's safe to unpin the memory, and how much memory is it safe to pin to beging with. For RX we have some more complexity. I think unpin the memory is in kfree_skb() whenever the last reference is gone for TX. What we discussed about here is when/how vhost get notified to update ring buffer descriptors. Do I misunderstand something here? Right, that's a better way to put it. That's how this macvtap patch did. For how much pinned pages,it is limited by sk_wmem_alloc size in this patch. thanks Shirley Except that you seem to pin full pages but account sub-page size in wmem. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
Add zero copy feature between userspace and kernel in macvtap when lower device supports high memory DMA. Signed-off-by: Shirley Ma x...@us.ibm.com --- drivers/net/macvtap.c | 136 + 1 files changed, 126 insertions(+), 10 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 3b1c54a..186cde1 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -274,6 +274,7 @@ static int macvtap_open(struct inode *inode, struct file *file) struct net *net = current-nsproxy-net_ns; struct net_device *dev = dev_get_by_index(net, iminor(inode)); struct macvtap_queue *q; + struct macvlan_dev *vlan = netdev_priv(dev); int err; err = -ENODEV; @@ -302,6 +303,17 @@ static int macvtap_open(struct inode *inode, struct file *file) q-flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP; q-vnet_hdr_sz = sizeof(struct virtio_net_hdr); + /* +* so far only VM uses macvtap, enable zero copy between guest +* kernel and host kernel when lower device supports high memory +* DMA +*/ + if (vlan) { + if ((vlan-lowerdev-features NETIF_F_HIGHDMA) + (vlan-lowerdev-features NETIF_F_SG)) + sock_set_flag(q-sk, SOCK_ZEROCOPY); + } + err = macvtap_set_queue(dev, file, q); if (err) sock_put(q-sk); @@ -343,6 +355,24 @@ out: return mask; } +#define GOODCOPY_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +static inline struct sk_buff *macvtap_alloc_skb_goodcopy(struct sock *sk, +size_t prepad, size_t copy, +int noblock, int *err) +{ + struct sk_buff *skb; + + skb = sock_alloc_send_pskb(sk, prepad + copy, 0, noblock, err); + if (!skb) + return NULL; + skb_reserve(skb, prepad); + skb_put(skb, copy); + + return skb; + +} + static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad, size_t len, size_t linear, int noblock, int *err) @@ -447,15 +477,91 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb, return 0; } +/* set skb frags from iovec, this can move to core network code for reuse */ +static int set_sg_from_iovec_zerocopy(struct sk_buff *skb, + const struct iovec *from, int offset, + size_t count) +{ + int len = iov_length(from, count) - offset; + int copy = skb_headlen(skb); + int size, offset1 = 0; + int i = 0; + skb_frag_t *f; + + /* Skip over from offset */ + while (offset = from-iov_len) { + offset -= from-iov_len; + ++from; + --count; + } + + /* copy up to skb headlen */ + while (copy 0) { + size = min_t(unsigned int, copy, from-iov_len - offset); + if (copy_from_user(skb-data + offset1, from-iov_base + offset, + size)) + return -EFAULT; + if (copy size) { + ++from; + --count; + } + copy -= size; + offset1 += size; + offset = 0; + } + + if (len == offset1) + return 0; + + while (count--) { + struct page *page[MAX_SKB_FRAGS]; + int num_pages; + unsigned long base; + + len = from-iov_len - offset1; + if (!len) { + offset1 = 0; + ++from; + continue; + } + base = (unsigned long)from-iov_base + offset1; + size = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; + num_pages = get_user_pages_fast(base, size, 0, page[i]); + if ((num_pages != size) || + (num_pages MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags)) + /* put_page is in skb free */ + return -EFAULT; + while (len) { + f = skb_shinfo(skb)-frags[i]; + f-page = page[i]; + f-page_offset = base ~PAGE_MASK; + f-size = min_t(int, len, PAGE_SIZE - f-page_offset); + skb-data_len += f-size; + skb-len += f-size; + skb-truesize += f-size; + skb_shinfo(skb)-nr_frags++; + /* increase sk_wmem_alloc */ + if (skb-sk skb-destructor == sock_wfree) + atomic_add(f-size, skb-sk-sk_wmem_alloc); +
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Shirley Ma mashi...@us.ibm.com Date: Mon, 13 Sep 2010 13:48:03 -0700 + base = (unsigned long)from-iov_base + offset1; + size = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; + num_pages = get_user_pages_fast(base, size, 0, page[i]); + if ((num_pages != size) || + (num_pages MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags)) + /* put_page is in skb free */ + return -EFAULT; What keeps the user from writing to these pages in it's address space after the write call returns? A write() return of success means: I wrote what you gave to me not I wrote what you gave to me, oh and BTW don't touch these pages for a while. In fact a while isn't even defined in any way, as there is no way for the write() invoker to know when the networking card is done with those pages. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html