Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Michael S. Tsirkin
On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote:
 Hello Michael,
 
 On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote:
 Don't you think once I address vhost_add_used_and_signal update
issue, it is a simple and complete patch for macvtap TX zero copy?

Thanks
Shirley
   
   I like the fact that the patch is simple. Unfortunately
   I suspect it'll stop being simple by the time it's complete :) 
  
  I can make a try. :)
 
 I compared several approaches for addressing the issue being raised here
 on how/when to update vhost_add_used_and_signal. The simple approach I
 have found is:
 
 1. Adding completion field in struct virtqueue;
 2. when it is a zero copy packet, put vhost thread wait for completion
 to update vhost_add_used_and_signal;
 3. passing vq from vhost to macvtap as skb destruct_arg;
 4. when skb is freed for the last reference, signal vq completion
 The test results show same performance as the original patch. How do you
 think? If it sounds good to you. I will resubmit this reversion patch.
 The patch still keeps as simple as it was before. :)
 
 Thanks
 Shirley

If you look at dev_hard_start_xmit you will see a call
to skb_orphan_try which often calls the skb destructor.
So I suspect this is almost equivalent to your original patch,
and has the same correctness issue.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Michael S. Tsirkin
On Wed, Sep 29, 2010 at 10:16:45AM +0200, Michael S. Tsirkin wrote:
 On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote:
  Hello Michael,
  
  On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote:
  Don't you think once I address vhost_add_used_and_signal update
 issue, it is a simple and complete patch for macvtap TX zero copy?
 
 Thanks
 Shirley

I like the fact that the patch is simple. Unfortunately
I suspect it'll stop being simple by the time it's complete :) 
   
   I can make a try. :)
  
  I compared several approaches for addressing the issue being raised here
  on how/when to update vhost_add_used_and_signal. The simple approach I
  have found is:
  
  1. Adding completion field in struct virtqueue;
  2. when it is a zero copy packet, put vhost thread wait for completion
  to update vhost_add_used_and_signal;
  3. passing vq from vhost to macvtap as skb destruct_arg;
  4. when skb is freed for the last reference, signal vq completion
  The test results show same performance as the original patch. How do you
  think? If it sounds good to you. I will resubmit this reversion patch.
  The patch still keeps as simple as it was before. :)
  
  Thanks
  Shirley
 
 If you look at dev_hard_start_xmit you will see a call
 to skb_orphan_try which often calls the skb destructor.
 So I suspect this is almost equivalent to your original patch,
 and has the same correctness issue.

So you could try doing skb_tx(skb)-prevent_sk_orphan = 1
just to see what will happen. Might be interesting - just
make sure the device doesn't orphan the skb first thing.
I suspect lack of parallelism will result in bad throughput
esp for small messages.

Note this still won't make it correct (this has module unloading
issue, and devices might still orphan skb, clone it, or hang on to
paged data in some other way) but at least closer.

I think you should try testing with guest to external communication,
this will uncover some of these correctness issues for you.
I think netperf also has some flag to check data, might
be a good idea to use it for testing.

 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Shirley Ma
On Wed, 2010-09-29 at 10:16 +0200, Michael S. Tsirkin wrote:
 If you look at dev_hard_start_xmit you will see a call
 to skb_orphan_try which often calls the skb destructor.
 So I suspect this is almost equivalent to your original patch,
 and has the same correctness issue.

I forgot to mention, in skb_orphan_try, I still kept the skb-sk
reference to sk, I assign skb-sk to NULL when the DMA done.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Shirley Ma
On Wed, 2010-09-29 at 10:28 +0200, Michael S. Tsirkin wrote:
 I think you should try testing with guest to external communication,
 this will uncover some of these correctness issues for you.
 I think netperf also has some flag to check data, might
 be a good idea to use it for testing.

I always tested guest to remote host, remote host to guest. What other
external communication do you suggest here?

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Shirley Ma
On Wed, 2010-09-29 at 10:16 +0200, Michael S. Tsirkin wrote:
  I compared several approaches for addressing the issue being raised
 here
  on how/when to update vhost_add_used_and_signal. The simple approach
 I
  have found is:
  
  1. Adding completion field in struct virtqueue;
  2. when it is a zero copy packet, put vhost thread wait for
 completion
  to update vhost_add_used_and_signal;
  3. passing vq from vhost to macvtap as skb destruct_arg;
  4. when skb is freed for the last reference, signal vq completion
  The test results show same performance as the original patch. How do
 you
  think? If it sounds good to you. I will resubmit this reversion
 patch.
  The patch still keeps as simple as it was before. :)
  
  Thanks
  Shirley
 
 If you look at dev_hard_start_xmit you will see a call
 to skb_orphan_try which often calls the skb destructor.
 So I suspect this is almost equivalent to your original patch,
 and has the same correctness issue. 

If I didn't address this, then vhost net will wait for ever with wait
for completion :))


Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Shirley Ma
On Wed, 2010-09-29 at 10:28 +0200, Michael S. Tsirkin wrote:
   1. Adding completion field in struct virtqueue;
   2. when it is a zero copy packet, put vhost thread wait for
 completion
   to update vhost_add_used_and_signal;
   3. passing vq from vhost to macvtap as skb destruct_arg;
   4. when skb is freed for the last reference, signal vq completion
   The test results show same performance as the original patch. How
 do you
   think? If it sounds good to you. I will resubmit this reversion
 patch.
   The patch still keeps as simple as it was before. :)
   
   Thanks
   Shirley
  
  If you look at dev_hard_start_xmit you will see a call
  to skb_orphan_try which often calls the skb destructor.
  So I suspect this is almost equivalent to your original patch,
  and has the same correctness issue.
 
 So you could try doing skb_tx(skb)-prevent_sk_orphan = 1
 just to see what will happen. Might be interesting - just
 make sure the device doesn't orphan the skb first thing.
 I suspect lack of parallelism will result in bad throughput
 esp for small messages.
 
 Note this still won't make it correct (this has module unloading
 issue, and devices might still orphan skb, clone it, or hang on to
 paged data in some other way) but at least closer. 

For message size smaller than 128, it still does copy. I tested some
small message size, I didn't see any regression. I will run more test to
focus on small message size between 128 - 4K.

I don't need prevent_sk_orphan since in skb_release_data for last
reference, I just need the ZEROCOPY flag from that sock to signal a
completion.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Michael S. Tsirkin
On Tue, Sep 28, 2010 at 08:24:29PM -0700, Shirley Ma wrote:
 Hello Michael,
 
 On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote:
 Don't you think once I address vhost_add_used_and_signal update
issue, it is a simple and complete patch for macvtap TX zero copy?

Thanks
Shirley
   
   I like the fact that the patch is simple. Unfortunately
   I suspect it'll stop being simple by the time it's complete :) 
  
  I can make a try. :)
 
 I compared several approaches for addressing the issue being raised here
 on how/when to update vhost_add_used_and_signal. The simple approach I
 have found is:
 
 1. Adding completion field in struct virtqueue;
 2. when it is a zero copy packet, put vhost thread wait for completion
 to update vhost_add_used_and_signal;
 3. passing vq from vhost to macvtap as skb destruct_arg;
 4. when skb is freed for the last reference, signal vq completion
 
 The test results show same performance as the original patch. How do you
 think? If it sounds good to you. I will resubmit this reversion patch.
 The patch still keeps as simple as it was before. :)
 
 Thanks
 Shirley

I guess I just don't understand what your patch does.
If you send it, I can take a look.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-29 Thread Shirley Ma
On Wed, 2010-09-29 at 17:14 +0200, Michael S. Tsirkin wrote:
 I guess I just don't understand what your patch does.
 If you send it, I can take a look.

Ok, I will collect more performance data and send the patch.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-28 Thread Shirley Ma
Hello Michael,

On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote:
Don't you think once I address vhost_add_used_and_signal update
   issue, it is a simple and complete patch for macvtap TX zero copy?
   
   Thanks
   Shirley
  
  I like the fact that the patch is simple. Unfortunately
  I suspect it'll stop being simple by the time it's complete :) 
 
 I can make a try. :)

I compared several approaches for addressing the issue being raised here
on how/when to update vhost_add_used_and_signal. The simple approach I
have found is:

1. Adding completion field in struct virtqueue;
2. when it is a zero copy packet, put vhost thread wait for completion
to update vhost_add_used_and_signal;
3. passing vq from vhost to macvtap as skb destruct_arg;
4. when skb is freed for the last reference, signal vq completion

The test results show same performance as the original patch. How do you
think? If it sounds good to you. I will resubmit this reversion patch.
The patch still keeps as simple as it was before. :)

Thanks
Shirley


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-16 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 15, 2010 5:59 PM
To: Xin, Xiaohui
Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Wednesday, September 15, 2010 12:30 AM
 To: Shirley Ma
 Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
 net...@vger.kernel.org;
 kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
 kernel
 
 On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
  On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
   I would expect this to hurt performance significantly.
   We could do this for asynchronous requests only to avoid the
   slowdown.
 
  Is kiocb in sendmsg helpful here? It is not used now.
 
  Shirley
 
 Precisely. This is what the patch from Xin Xiaohui does.  That code
 already seems to do most of what you are trying to do, right?
 
 The main thing missing seems to be macvtap integration, so that we can fall 
 back
 on data copy if zero copy is unavailable?
 How hard would it be to basically link the mp and macvtap modules
 together to get us this functionality? Anyone?
 
 Michael,
 Is to support macvtap with zero-copy through mp device the functionality
 you mentioned above?

I have trouble parsing the above question.  At some point Arnd suggested
that the mp device functionality would fit nicely as part of the macvtap
driver.  It seems to make sense superficially, the advantage if it
worked would be that we would get zero copy (mostly) transparently.

Do you agree with this goal?


I would say yes.

 Before Shirley Ma has suggested to move the zero-copy functionality into
 tun/tap device or macvtap device. How do you think about that?
 I suspect
 there will be a lot of duplicate code in that three drivers except we can 
 extract
 code of zero-copy into kernel APIs and vhost APIs.


tap would be very hard at this point as it does not bind to a device.
macvtap might work, we mainly need to figure out a way to detect that
device can do zero copy so the right mode is used.  I think a first step
could be to simply link mp code into macvtap module, pass necessary
ioctls on, then move some code around as necessary.  This might get rid
of code duplication nicely.

I'll look into this to see how much effort would be.



 Do you think that's worth to do and help current process which is blocked too
 long than I expected?

I think it's nice to have.

And if done hopefully this will get the folk working on the macvtap
driver to review the code, which will help find all issues faster.

I think if you post some performance numbers,
this will also help get people excited and looking at the code.


The performance data I have posted before is compared with raw socket on 
vhost-net.
But currently, the raw socket backend is removed from the qemu side.
So I can only compare with tap on vhost-net. But unfortunately, I missed 
something
that I even can't bring it up. I was blocked by this for a time.

I also don't see the process as completely blocked, each review round points
out more issues: we aren't going back and forth changing
same lines again and again, are we?

One thing that might help is increase the frequency of updates,
try sending them out sooner.
On the other hand 10 new patches each revision is a lot:
if there is a part of patchset that has stabilised you can split it out,
post once and keep posting the changing part separately.

I hope these suggestions help.

Thanks, Michael!


 
 --
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-16 Thread Michael S. Tsirkin
On Thu, Sep 16, 2010 at 04:18:10PM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Wednesday, September 15, 2010 5:59 PM
 To: Xin, Xiaohui
 Cc: Shirley Ma; Arnd Bergmann; Avi Kivity; David Miller; 
 net...@vger.kernel.org;
 kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
 kernel
 
 On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote:
  From: Michael S. Tsirkin [mailto:m...@redhat.com]
  Sent: Wednesday, September 15, 2010 12:30 AM
  To: Shirley Ma
  Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
  net...@vger.kernel.org;
  kvm@vger.kernel.org; linux-ker...@vger.kernel.org
  Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
  kernel
  
  On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
   On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
I would expect this to hurt performance significantly.
We could do this for asynchronous requests only to avoid the
slowdown.
  
   Is kiocb in sendmsg helpful here? It is not used now.
  
   Shirley
  
  Precisely. This is what the patch from Xin Xiaohui does.  That code
  already seems to do most of what you are trying to do, right?
  
  The main thing missing seems to be macvtap integration, so that we can 
  fall back
  on data copy if zero copy is unavailable?
  How hard would it be to basically link the mp and macvtap modules
  together to get us this functionality? Anyone?
  
  Michael,
  Is to support macvtap with zero-copy through mp device the functionality
  you mentioned above?
 
 I have trouble parsing the above question.  At some point Arnd suggested
 that the mp device functionality would fit nicely as part of the macvtap
 driver.  It seems to make sense superficially, the advantage if it
 worked would be that we would get zero copy (mostly) transparently.
 
 Do you agree with this goal?
 
 
 I would say yes.

In that case, it's a blocker for upstream merge because this change
affects userspace.

  Before Shirley Ma has suggested to move the zero-copy functionality into
  tun/tap device or macvtap device. How do you think about that?
  I suspect
  there will be a lot of duplicate code in that three drivers except we can 
  extract
  code of zero-copy into kernel APIs and vhost APIs.
 
 
 tap would be very hard at this point as it does not bind to a device.
 macvtap might work, we mainly need to figure out a way to detect that
 device can do zero copy so the right mode is used.  I think a first step
 could be to simply link mp code into macvtap module, pass necessary
 ioctls on, then move some code around as necessary.  This might get rid
 of code duplication nicely.
 
 I'll look into this to see how much effort would be.
 
 
 
  Do you think that's worth to do and help current process which is blocked 
  too
  long than I expected?
 
 I think it's nice to have.
 
 And if done hopefully this will get the folk working on the macvtap
 driver to review the code, which will help find all issues faster.
 
 I think if you post some performance numbers,
 this will also help get people excited and looking at the code.
 
 
 The performance data I have posted before is compared with raw socket on 
 vhost-net.
 But currently, the raw socket backend is removed from the qemu side.
 So I can only compare with tap on vhost-net. But unfortunately, I missed 
 something
 that I even can't bring it up. I was blocked by this for a time.

Hey, maybe you are seeing the bug that was reported recently.
Could you try tcpdump -i on the tap interface in host and ethX on guest
and tell me what you see?
If you see packet in guest but not in host, could you try
adding printks in vhost handle_tx to see whether it gets called
and if yes where it fails?

 I also don't see the process as completely blocked, each review round points
 out more issues: we aren't going back and forth changing
 same lines again and again, are we?
 
 One thing that might help is increase the frequency of updates,
 try sending them out sooner.
 On the other hand 10 new patches each revision is a lot:
 if there is a part of patchset that has stabilised you can split it out,
 post once and keep posting the changing part separately.
 
 I hope these suggestions help.
 
 Thanks, Michael!
 
 
  
  --
  MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Shirley Ma
On Wed, 2010-09-15 at 07:27 +0200, Michael S. Tsirkin wrote:
 For some of the issues, try following the discussion around
 net: af_packet: don't call tpacket_destruct_skb() until the skb is
 sent
 out.
 
 Summary: it's difficult to do correctly generally. Limiting ourselves
 to transmit on specific devices might make it possible.

Thanks for the tips.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Shirley Ma
On Wed, 2010-09-15 at 07:12 +0200, Michael S. Tsirkin wrote:
 Yes, I agree this patch is useful for demo purposes:
 simple, and shows what kind of performance gains
 we can expect for TX. 

Any other issue you can see in this patch beside vhost descriptors
update?  Don't you think once I address vhost_add_used_and_signal update
issue, it is a simple and complete patch for macvtap TX zero copy?

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Michael S. Tsirkin
On Wed, Sep 15, 2010 at 10:46:02AM +0800, Xin, Xiaohui wrote:
 From: Michael S. Tsirkin [mailto:m...@redhat.com]
 Sent: Wednesday, September 15, 2010 12:30 AM
 To: Shirley Ma
 Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
 net...@vger.kernel.org;
 kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
 kernel
 
 On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
  On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
   I would expect this to hurt performance significantly.
   We could do this for asynchronous requests only to avoid the
   slowdown.
 
  Is kiocb in sendmsg helpful here? It is not used now.
 
  Shirley
 
 Precisely. This is what the patch from Xin Xiaohui does.  That code
 already seems to do most of what you are trying to do, right?
 
 The main thing missing seems to be macvtap integration, so that we can fall 
 back
 on data copy if zero copy is unavailable?
 How hard would it be to basically link the mp and macvtap modules
 together to get us this functionality? Anyone?
 
 Michael,
 Is to support macvtap with zero-copy through mp device the functionality
 you mentioned above?

I have trouble parsing the above question.  At some point Arnd suggested
that the mp device functionality would fit nicely as part of the macvtap
driver.  It seems to make sense superficially, the advantage if it
worked would be that we would get zero copy (mostly) transparently.

Do you agree with this goal?

 Before Shirley Ma has suggested to move the zero-copy functionality into
 tun/tap device or macvtap device. How do you think about that?
 I suspect
 there will be a lot of duplicate code in that three drivers except we can 
 extract
 code of zero-copy into kernel APIs and vhost APIs.


tap would be very hard at this point as it does not bind to a device.
macvtap might work, we mainly need to figure out a way to detect that
device can do zero copy so the right mode is used.  I think a first step
could be to simply link mp code into macvtap module, pass necessary
ioctls on, then move some code around as necessary.  This might get rid
of code duplication nicely.


 Do you think that's worth to do and help current process which is blocked too
 long than I expected?

I think it's nice to have.

And if done hopefully this will get the folk working on the macvtap
driver to review the code, which will help find all issues faster.

I think if you post some performance numbers,
this will also help get people excited and looking at the code.

I also don't see the process as completely blocked, each review round points
out more issues: we aren't going back and forth changing
same lines again and again, are we?

One thing that might help is increase the frequency of updates,
try sending them out sooner.
On the other hand 10 new patches each revision is a lot:
if there is a part of patchset that has stabilised you can split it out,
post once and keep posting the changing part separately.

I hope these suggestions help.

 
 --
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 11:21:15PM -0700, Shirley Ma wrote:
 On Wed, 2010-09-15 at 07:12 +0200, Michael S. Tsirkin wrote:
  Yes, I agree this patch is useful for demo purposes:
  simple, and shows what kind of performance gains
  we can expect for TX. 
 
 Any other issue you can see in this patch beside vhost descriptors
 update?

Another issue is that macvtap can be bound to almost
anything, including e.g. a tap device or a bridge,
which might hang on to skb fragments for unlimited time.
Zero copy TX won't easily work there.
I can imagine either somehow triggering a data copy after the
fact (hard), or detecting such devices and avoiding
zero copy (unfortunate for guest to guest, and drivers
will need tuning).

  Don't you think once I address vhost_add_used_and_signal update
 issue, it is a simple and complete patch for macvtap TX zero copy?
 
 Thanks
 Shirley

I like the fact that the patch is simple. Unfortunately
I suspect it'll stop being simple by the time it's complete :)

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Shirley Ma
On Wed, 2010-09-15 at 12:10 +0200, Michael S. Tsirkin wrote:
 Another issue is that macvtap can be bound to almost
 anything, including e.g. a tap device or a bridge,
 which might hang on to skb fragments for unlimited time.
 Zero copy TX won't easily work there.
 I can imagine either somehow triggering a data copy after the
 fact (hard), or detecting such devices and avoiding
 zero copy (unfortunate for guest to guest, and drivers
 will need tuning).

So far macvtap zero copy patch is limited to lower devices supports high
memory DMA, it doesn't apply to a tap device or a bridge.

   Don't you think once I address vhost_add_used_and_signal update
  issue, it is a simple and complete patch for macvtap TX zero copy?
  
  Thanks
  Shirley
 
 I like the fact that the patch is simple. Unfortunately
 I suspect it'll stop being simple by the time it's complete :) 

I can make a try. :)

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Shirley Ma
On Wed, 2010-09-15 at 17:39 +0200, Michael S. Tsirkin wrote:
 In fact, I rechecked: both bridge and loopback have NETIF_F_HIGHDMA
 set.
 So maybe we should check NETIF_F_NETNS_LOCAL ...
 
 macvtap in bridged mode is interesting as well. 

I found that too, just wondered which flag to use is better. :)

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

From maverick-changes-boun...@lists.ubuntu.com Wed Sep 15 10:00:38 2010
Return-path: maverick-changes-boun...@lists.ubuntu.com
Envelope-to: arch...@mail-archive.com
Delivery-date: Wed, 15 Sep 2010 10:00:38 -0700
Received: from exprod5mx266.postini.com ([64.18.0.89] helo=psmtp.com)
by mail-archive.com with smtp (Exim 4.69)
(envelope-from maverick-changes-boun...@lists.ubuntu.com)
id 1OvvLa-0005Ig-QC
for arch...@mail-archive.com; Wed, 15 Sep 2010 10:00:38 -0700
Received: from source ([91.189.94.204]) by exprod5mx266.postini.com 
([64.18.4.10]) with SMTP;
Wed, 15 Sep 2010 10:00:37 PDT
Received: from localhost ([127.0.0.1] helo=chlorine.canonical.com)
by chlorine.canonical.com with esmtp (Exim 4.69)
(envelope-from maverick-changes-boun...@lists.ubuntu.com)
id 1OvvLT-0006E2-1a; Wed, 15 Sep 2010 18:00:31 +0100
Received: from adelie.canonical.com ([91.189.90.139])
by chlorine.canonical.com with esmtp (Exim 4.69)
(envelope-from boun...@canonical.com) id 1OvvLQ-0006DP-Ih
for maverick-chan...@lists.ubuntu.com; Wed, 15 Sep 2010 18:00:28 +0100
Received: from cocoplum.canonical.com ([91.189.90.15])
by adelie.canonical.com with esmtp (Exim 4.69 #1 (Debian))
id 1OvvLQ-pq-40
for maverick-chan...@lists.ubuntu.com; Wed, 15 Sep 2010 18:00:28 +0100
Received: from cocoplum.canonical.com (localhost [127.0.0.1])
by cocoplum.canonical.com (Postfix) with ESMTP id 195C816C0A6
for maverick-chan...@lists.ubuntu.com;
Wed, 15 Sep 2010 17:00:28 + (UTC)
Content-Type: multipart/mixed; boundary7323435133034647172==
MIME-Version: 1.0
from: Fabrice Coutadeur fabric...@ubuntu.com
subject: [ubuntu/maverick] t38modem 1.2.0-1build1 (Accepted)
to: maverick-chan...@lists.ubuntu.com
X-Launchpad-Component: component=universe, section=comm
X-Katie: Launchpad actually
Message-Id: 20100915170028.11124.85014.launch...@cocoplum.canonical.com
Date: Wed, 15 Sep 2010 17:00:28 -
Precedence: bulk
X-Generated-By: Launchpad (canonical.com); Revision=None;
Instance=initZopeless config overlay
X-Launchpad-Hash: 9370f3bb58447beb8fcb7983a6adb4937501c276
X-BeenThere: maverick-chan...@lists.ubuntu.com
X-Mailman-Version: 2.1.9
Reply-To: Fabrice Coutadeur fabric...@ubuntu.com
List-Id: Maverick Meerkat archive upload notification list
maverick-changes.lists.ubuntu.com
List-Unsubscribe: https://lists.ubuntu.com/mailman/listinfo/maverick-changes, 
mailto:maverick-changes-requ...@lists.ubuntu.com?subject=unsubscribe
List-Archive: https://lists.ubuntu.com/archives/maverick-changes
List-Post: mailto:maverick-chan...@lists.ubuntu.com
List-Help: mailto:maverick-changes-requ...@lists.ubuntu.com?subject=help
List-Subscribe: https://lists.ubuntu.com/mailman/listinfo/maverick-changes, 
mailto:maverick-changes-requ...@lists.ubuntu.com?subject=subscribe
Sender: maverick-changes-boun...@lists.ubuntu.com
Errors-To: maverick-changes-boun...@lists.ubuntu.com
X-pstn-levels: (S:99.9/99.9 CV:99.9000 FC:95.5390 LC:95.5390 
R:95.9108 P:95.9108 M:97.0282 C:98.6951 )
X-pstn-settings: 4 (1.5000:1.5000) s cv gt3 gt2 gt1 r p m c 
X-pstn-addresses: from fabric...@ubuntu.com [294/10] 

--===7323435133034647172==
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

t38modem (1.2.0-1build1) maverick; urgency=3Dlow

  * No changes upload for libopal3.6.6 - libopal3.6.8 transition

Date: Wed, 15 Sep 2010 08:18:57 +0200
Changed-By: Fabrice Coutadeur fabric...@ubuntu.com
Maintainer: Debian VoIP Team pkg-voip-maintain...@lists.alioth.debian.org
Signed-By: Fabrice Coutadeur coutade...@gmail.com
https://launchpad.net/ubuntu/maverick/+source/t38modem/1.2.0-1build1

--===7323435133034647172==
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename=changesfile

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Format: 1.8
Date: Wed, 15 Sep 2010 08:18:57 +0200
Source: t38modem
Binary: t38modem
Architecture: source
Version: 1.2.0-1build1
Distribution: maverick
Urgency: low
Maintainer: Debian VoIP Team pkg-voip-maintain...@lists.alioth.debian.org
Changed-By: Fabrice Coutadeur fabric...@ubuntu.com
Description: =

 t38modem   - T.38 Fax over IP pseudo modem
Changes: =

 t38modem (1.2.0-1build1) maverick; urgency=3Dlow
 .
   * No changes upload for 

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Michael S. Tsirkin
On Wed, Sep 15, 2010 at 10:00:04AM -0700, Shirley Ma wrote:
 On Wed, 2010-09-15 at 17:39 +0200, Michael S. Tsirkin wrote:
  In fact, I rechecked: both bridge and loopback have NETIF_F_HIGHDMA
  set.
  So maybe we should check NETIF_F_NETNS_LOCAL ...
  
  macvtap in bridged mode is interesting as well. 
 
 I found that too, just wondered which flag to use is better. :)
 
 Thanks
 Shirley

At some level NETIF_F_NETNS_LOCAL makes sense: local packets
can get anywhere. OTOH one wonders whether there might be other
issues, e.g. in theory devices could hang on to frag pages
just by doing get_page.  There might be other issues.
Maybe we are better off white-listing known-good drivers
with a new flag?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-15 Thread Shirley Ma
On Wed, 2010-09-15 at 19:30 +0200, Michael S. Tsirkin wrote:
 At some level NETIF_F_NETNS_LOCAL makes sense: local packets
 can get anywhere. OTOH one wonders whether there might be other
 issues, e.g. in theory devices could hang on to frag pages
 just by doing get_page.  There might be other issues.
 Maybe we are better off white-listing known-good drivers
 with a new flag? 

Sounds reasonable to me if there is no such a flag to be easily used.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
  +base = (unsigned long)from-iov_base + offset1;
  +size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)
 PAGE_SHIFT;
  +num_pages = get_user_pages_fast(base, size,
 0,page[i]);
  +if ((num_pages != size) ||
  +(num_pages  MAX_SKB_FRAGS -
 skb_shinfo(skb)-nr_frags))
  +/* put_page is in skb free */
  +return -EFAULT;
  What keeps the user from writing to these pages in it's address
 space
  after the write call returns?
 
  A write() return of success means:
 
I wrote what you gave to me
 
  not
 
I wrote what you gave to me, oh and BTW don't touch these
pages for a while.
 
  In fact a while isn't even defined in any way, as there is no way
  for the write() invoker to know when the networking card is done
 with
  those pages.
 
 That's what io_submit() is for.  Then io_getevents() tells you what
 a 
 while actually was.

This macvtap zero copy uses iov buffers from vhost ring, which is
allocated from guest kernel. In host kernel, vhost calls macvtap
sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
pages for zero copy.

The patch is relying on how vhost handle these buffers. I need to look
at vhost code (qemu) first for addressing the questions here.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Arnd Bergmann
On Tuesday 14 September 2010, Shirley Ma wrote:
 On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:

  That's what io_submit() is for.  Then io_getevents() tells you what
  a 
  while actually was.
 
 This macvtap zero copy uses iov buffers from vhost ring, which is
 allocated from guest kernel. In host kernel, vhost calls macvtap
 sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
 pages for zero copy.
 
 The patch is relying on how vhost handle these buffers. I need to look
 at vhost code (qemu) first for addressing the questions here.

I guess the best solution would be to make macvtap_aio_write return
-EIOCBQUEUED when a packet gets passed down to the adapter, and
call aio_complete when the adapter is done with it.

This would change the regular behavior of macvtap into a model where
every write on the file blocks until the packet has left the machine,
which gives us better flow control, but does slow down the traffic
when we only put one packet at a time into the queue.

It also allows the user to call io_submit instead of write in order
to do an asynchronous submission as Avi was suggesting.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
 I would expect this to hurt performance significantly.
 We could do this for asynchronous requests only to avoid the
 slowdown. 

Is kiocb in sendmsg helpful here? It is not used now.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
  I would expect this to hurt performance significantly.
  We could do this for asynchronous requests only to avoid the
  slowdown. 
 
 Is kiocb in sendmsg helpful here? It is not used now.
 
 Shirley

Precisely. This is what the patch from Xin Xiaohui does.  That code
already seems to do most of what you are trying to do, right?

The main thing missing seems to be macvtap integration, so that we can fall back
on data copy if zero copy is unavailable?
How hard would it be to basically link the mp and macvtap modules
together to get us this functionality? Anyone?


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote:
 Precisely. This is what the patch from Xin Xiaohui does.  That code
 already seems to do most of what you are trying to do, right?

I thought host pins guest kernel buffer pages was good enough for TX
thought I didn't look up xiaohui's vhost asycn io patch in details.

What's the performance data Xiaohui got from using kiocb? I haven't seen
any performance number from him yet.

 The main thing missing seems to be macvtap integration, so that we can
 fall back
 on data copy if zero copy is unavailable?
 How hard would it be to basically link the mp and macvtap modules
 together to get us this functionality? Anyone? 

The simple integration is using macvtap + xiaohui's vhost asycn io
patch. I can make a try for TX only.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 10:02:25AM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote:
  Precisely. This is what the patch from Xin Xiaohui does.  That code
  already seems to do most of what you are trying to do, right?
 
 I thought host pins guest kernel buffer pages was good enough for TX
 thought I didn't look up xiaohui's vhost asycn io patch in details.

As others said, the harder issues for TX are in determining that it's safe
to unpin the memory, and how much memory is it safe to pin to beging
with.  For RX we have some more complexity.

 What's the performance data Xiaohui got from using kiocb? I haven't seen
 any performance number from him yet.
 
  The main thing missing seems to be macvtap integration, so that we can
  fall back
  on data copy if zero copy is unavailable?
  How hard would it be to basically link the mp and macvtap modules
  together to get us this functionality? Anyone? 
 
 The simple integration is using macvtap + xiaohui's vhost asycn io
 patch. I can make a try for TX only.
 
 Thanks
 Shirley

Well it's up to you of course, but this is not what I would try:
if you look at the
patchset vhost changes is not the largest part of it,
so this sounds a bit like effort duplication.

TX only is also much less interesting than full zero copy.

I think that you should be able to simply combine
the two drivers together, add an ioctl to
enable/disable zero copy mode of operation.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
 As others said, the harder issues for TX are in determining that it's
 safe
 to unpin the memory, and how much memory is it safe to pin to beging
 with.  For RX we have some more complexity.

I think unpin the memory is in kfree_skb() whenever the last reference
is gone for TX. What we discussed about here is when/how vhost get
notified to update ring buffer descriptors. Do I misunderstand something
here? 

 Well it's up to you of course, but this is not what I would try:
 if you look at the
 patchset vhost changes is not the largest part of it,
 so this sounds a bit like effort duplication.
 
 TX only is also much less interesting than full zero copy.

It's not true. From the performance, TX only has gained big improvement.
We need to identify how much gain from TX zero copy, and how much gain
from RX zero copy.

 I think that you should be able to simply combine
 the two drivers together, add an ioctl to
 enable/disable zero copy mode of operation. 

That could work. But what's the purpose to have two drivers if one
driver can handle it?

Thanks
Shirley


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
  As others said, the harder issues for TX are in determining that it's
  safe
  to unpin the memory, and how much memory is it safe to pin to beging
  with.  For RX we have some more complexity.
 
 I think unpin the memory is in kfree_skb() whenever the last reference
 is gone for TX. What we discussed about here is when/how vhost get
 notified to update ring buffer descriptors. Do I misunderstand something
 here? 

Right, that's a better way to put it.

  Well it's up to you of course, but this is not what I would try:
  if you look at the
  patchset vhost changes is not the largest part of it,
  so this sounds a bit like effort duplication.
  
  TX only is also much less interesting than full zero copy.
 
 It's not true. From the performance, TX only has gained big improvement.
 We need to identify how much gain from TX zero copy, and how much gain
 from RX zero copy.

I was speaking from the code point of view:
since we'll want both TX and RX eventually it's nice to
see that some thought was given to RX even if we only merge
TX as a first step.

From the product POV, RX is already slower (more interrupts, etc)
than TX so speeding it up might be more important,
but I agree, every bit helps.

  I think that you should be able to simply combine
  the two drivers together, add an ioctl to
  enable/disable zero copy mode of operation. 
 
 That could work. But what's the purpose to have two drivers if one
 driver can handle it?
 
 Thanks
 Shirley

This was just an idea: I thought it's a good way for people interested
in this zero copy thing to combine forces and avoid making
the same mistakes, but it's not a must of course.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
 On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
  On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
   As others said, the harder issues for TX are in determining that
 it's
   safe
   to unpin the memory, and how much memory is it safe to pin to
 beging
   with.  For RX we have some more complexity.
  
  I think unpin the memory is in kfree_skb() whenever the last
 reference
  is gone for TX. What we discussed about here is when/how vhost get
  notified to update ring buffer descriptors. Do I misunderstand
 something
  here? 
 
 Right, that's a better way to put it. 

That's how this macvtap patch did. For how much pinned pages,it is
limited by sk_wmem_alloc size in this patch.

thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
   I think that you should be able to simply combine
   the two drivers together, add an ioctl to
   enable/disable zero copy mode of operation. 
  
  That could work. But what's the purpose to have two drivers if one
  driver can handle it?
  
  Thanks
  Shirley
 
 This was just an idea: I thought it's a good way for people interested
 in this zero copy thing to combine forces and avoid making
 the same mistakes, but it's not a must of course. 

Ok, I will make a simple patch by reusing Xiaohui's some vhost code on
handling vhost_add_used_and_signal() to see any performance changes.

The interesting thing here when I run 32 instances netperf/netserver I
didn't see any issue w/i this patch.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Shirley Ma [mailto:mashi...@us.ibm.com]
Sent: Tuesday, September 14, 2010 11:05 PM
To: Avi Kivity
Cc: David Miller; a...@arndb.de; m...@redhat.com; Xin, Xiaohui; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
  +base = (unsigned long)from-iov_base + offset1;
  +size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)
 PAGE_SHIFT;
  +num_pages = get_user_pages_fast(base, size,
 0,page[i]);
  +if ((num_pages != size) ||
  +(num_pages  MAX_SKB_FRAGS -
 skb_shinfo(skb)-nr_frags))
  +/* put_page is in skb free */
  +return -EFAULT;
  What keeps the user from writing to these pages in it's address
 space
  after the write call returns?
 
  A write() return of success means:
 
I wrote what you gave to me
 
  not
 
I wrote what you gave to me, oh and BTW don't touch these
pages for a while.
 
  In fact a while isn't even defined in any way, as there is no way
  for the write() invoker to know when the networking card is done
 with
  those pages.

 That's what io_submit() is for.  Then io_getevents() tells you what
 a
 while actually was.

This macvtap zero copy uses iov buffers from vhost ring, which is
allocated from guest kernel. In host kernel, vhost calls macvtap
sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
pages for zero copy.

The patch is relying on how vhost handle these buffers. I need to look
at vhost code (qemu) first for addressing the questions here.

Thanks
Shirley

I think what David said is what we have thought before in mp device.
Since we are not sure the exact time the tx buffer was wrote though DMA 
operation.
But the deadline is when the tx buffer was freed. So we only notify the vhost 
stuff
about the write when tx buffer freed. But the deadline is maybe too late for 
performance.

Thanks
Xiaohui 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Arnd Bergmann [mailto:a...@arndb.de]
Sent: Tuesday, September 14, 2010 11:21 PM
To: Shirley Ma
Cc: Avi Kivity; David Miller; m...@redhat.com; Xin, Xiaohui; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tuesday 14 September 2010, Shirley Ma wrote:
 On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:

  That's what io_submit() is for.  Then io_getevents() tells you what
  a
  while actually was.

 This macvtap zero copy uses iov buffers from vhost ring, which is
 allocated from guest kernel. In host kernel, vhost calls macvtap
 sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
 pages for zero copy.

 The patch is relying on how vhost handle these buffers. I need to look
 at vhost code (qemu) first for addressing the questions here.

I guess the best solution would be to make macvtap_aio_write return
-EIOCBQUEUED when a packet gets passed down to the adapter, and
call aio_complete when the adapter is done with it.

This would change the regular behavior of macvtap into a model where
every write on the file blocks until the packet has left the machine,
which gives us better flow control, but does slow down the traffic
when we only put one packet at a time into the queue.

It also allows the user to call io_submit instead of write in order
to do an asynchronous submission as Avi was suggesting.


But currently, this patch is communicated with vhost-net, which is almost
in the kernel side. If it uses aio stuff, it should be communicate with user
space Backend. 
 
   Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma
On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
 I think what David said is what we have thought before in mp device.
 Since we are not sure the exact time the tx buffer was wrote though
 DMA operation.
 But the deadline is when the tx buffer was freed. So we only notify
 the vhost stuff
 about the write when tx buffer freed. But the deadline is maybe too
 late for performance.

Have you tried it? If so what's the performance penalty you have seen by
notifying vhost when tx buffer freed?

I am thinking to have a callback in skb destructor,
vhost_add_used_and_signal gets updated when skb is actually freed, vhost
vq  head need to be passed to the callback. This might requires vhost
ring size is at least as big as the lower device driver. 

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Wednesday, September 15, 2010 12:30 AM
To: Shirley Ma
Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
  I would expect this to hurt performance significantly.
  We could do this for asynchronous requests only to avoid the
  slowdown.

 Is kiocb in sendmsg helpful here? It is not used now.

 Shirley

Precisely. This is what the patch from Xin Xiaohui does.  That code
already seems to do most of what you are trying to do, right?

The main thing missing seems to be macvtap integration, so that we can fall 
back
on data copy if zero copy is unavailable?
How hard would it be to basically link the mp and macvtap modules
together to get us this functionality? Anyone?

Michael,
Is to support macvtap with zero-copy through mp device the functionality
you mentioned above?
Before Shirley Ma has suggested to move the zero-copy functionality into
tun/tap device or macvtap device. How do you think about that? I suspect
there will be a lot of duplicate code in that three drivers except we can 
extract
code of zero-copy into kernel APIs and vhost APIs.
Do you think that's worth to do and help current process which is blocked too
long than I expected?


--
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui
From: Shirley Ma [mailto:mashi...@us.ibm.com]
Sent: Wednesday, September 15, 2010 10:41 AM
To: Xin, Xiaohui
Cc: Avi Kivity; David Miller; a...@arndb.de; m...@redhat.com; 
net...@vger.kernel.org;
kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
kernel

On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
 I think what David said is what we have thought before in mp device.
 Since we are not sure the exact time the tx buffer was wrote though
 DMA operation.
 But the deadline is when the tx buffer was freed. So we only notify
 the vhost stuff
 about the write when tx buffer freed. But the deadline is maybe too
 late for performance.

Have you tried it? If so what's the performance penalty you have seen by
notifying vhost when tx buffer freed?


We did not try it before, as we cared RX side more.

I am thinking to have a callback in skb destructor,
vhost_add_used_and_signal gets updated when skb is actually freed, vhost
vq  head need to be passed to the callback. This might requires vhost
ring size is at least as big as the lower device driver.


That's almost the same what we have done except we use destructor_arg and
another callback..

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 12:36:23PM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
I think that you should be able to simply combine
the two drivers together, add an ioctl to
enable/disable zero copy mode of operation. 
   
   That could work. But what's the purpose to have two drivers if one
   driver can handle it?
   
   Thanks
   Shirley
  
  This was just an idea: I thought it's a good way for people interested
  in this zero copy thing to combine forces and avoid making
  the same mistakes, but it's not a must of course. 
 
 Ok, I will make a simple patch by reusing Xiaohui's some vhost code on
 handling vhost_add_used_and_signal() to see any performance changes.
 
 The interesting thing here when I run 32 instances netperf/netserver I
 didn't see any issue w/i this patch.
 
 Thanks
 Shirley

Yes, I agree this patch is useful for demo purposes:
simple, and shows what kind of performance gains
we can expect for TX.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 07:40:52PM -0700, Shirley Ma wrote:
 On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
  I think what David said is what we have thought before in mp device.
  Since we are not sure the exact time the tx buffer was wrote though
  DMA operation.
  But the deadline is when the tx buffer was freed. So we only notify
  the vhost stuff
  about the write when tx buffer freed. But the deadline is maybe too
  late for performance.
 
 Have you tried it? If so what's the performance penalty you have seen by
 notifying vhost when tx buffer freed?
 
 I am thinking to have a callback in skb destructor,
 vhost_add_used_and_signal gets updated when skb is actually freed, vhost
 vq  head need to be passed to the callback. This might requires vhost
 ring size is at least as big as the lower device driver. 
 
 Thanks
 Shirley

For some of the issues, try following the discussion around
net: af_packet: don't call tpacket_destruct_skb() until the skb is sent
out.

Summary: it's difficult to do correctly generally. Limiting ourselves
to transmit on specific devices might make it possible.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin
On Tue, Sep 14, 2010 at 12:20:29PM -0700, Shirley Ma wrote:
 On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
  On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
   On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
As others said, the harder issues for TX are in determining that
  it's
safe
to unpin the memory, and how much memory is it safe to pin to
  beging
with.  For RX we have some more complexity.
   
   I think unpin the memory is in kfree_skb() whenever the last
  reference
   is gone for TX. What we discussed about here is when/how vhost get
   notified to update ring buffer descriptors. Do I misunderstand
  something
   here? 
  
  Right, that's a better way to put it. 
 
 That's how this macvtap patch did. For how much pinned pages,it is
 limited by sk_wmem_alloc size in this patch.
 
 thanks
 Shirley

Except that you seem to pin full pages but account sub-page size
in wmem.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-13 Thread Shirley Ma
Add zero copy feature between userspace and kernel in macvtap when lower device 
supports
high memory DMA.


Signed-off-by: Shirley Ma x...@us.ibm.com
---

 drivers/net/macvtap.c |  136 +
 1 files changed, 126 insertions(+), 10 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 3b1c54a..186cde1 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -274,6 +274,7 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
struct net *net = current-nsproxy-net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
struct macvtap_queue *q;
+   struct macvlan_dev *vlan = netdev_priv(dev);
int err;
 
err = -ENODEV;
@@ -302,6 +303,17 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
q-flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q-vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+   /*
+* so far only VM uses macvtap, enable zero copy between guest
+* kernel and host kernel when lower device supports high memory
+* DMA
+*/
+   if (vlan) {
+   if ((vlan-lowerdev-features  NETIF_F_HIGHDMA) 
+   (vlan-lowerdev-features  NETIF_F_SG))
+   sock_set_flag(q-sk, SOCK_ZEROCOPY);
+   }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(q-sk);
@@ -343,6 +355,24 @@ out:
return mask;
 }
 
+#define GOODCOPY_LEN  (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+static inline struct sk_buff *macvtap_alloc_skb_goodcopy(struct sock *sk,
+size_t prepad, size_t copy,
+int noblock, int *err)
+{
+   struct sk_buff *skb;
+
+   skb = sock_alloc_send_pskb(sk, prepad + copy, 0, noblock, err);
+   if (!skb)
+   return NULL;
+   skb_reserve(skb, prepad);
+   skb_put(skb, copy);
+
+   return skb;
+
+}
+
 static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
size_t len, size_t linear,
int noblock, int *err)
@@ -447,15 +477,91 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
return 0;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int set_sg_from_iovec_zerocopy(struct sk_buff *skb,
+ const struct iovec *from, int offset,
+ size_t count)
+{
+   int len = iov_length(from, count) - offset;
+   int copy = skb_headlen(skb);
+   int size, offset1 = 0;
+   int i = 0;
+   skb_frag_t *f;
+
+   /* Skip over from offset */
+   while (offset = from-iov_len) {
+   offset -= from-iov_len;
+   ++from;
+   --count;
+   }
+
+   /* copy up to skb headlen */
+   while (copy  0) {
+   size = min_t(unsigned int, copy, from-iov_len - offset);
+   if (copy_from_user(skb-data + offset1, from-iov_base + offset,
+  size))
+   return -EFAULT;
+   if (copy  size) {
+   ++from;
+   --count;
+   }
+   copy -= size;
+   offset1 += size;
+   offset = 0;
+   }
+
+   if (len == offset1)
+   return 0;
+
+   while (count--) {
+   struct page *page[MAX_SKB_FRAGS];
+   int num_pages;
+   unsigned long base;
+
+   len = from-iov_len - offset1;
+   if (!len) {
+   offset1 = 0;
+   ++from;
+   continue;
+   }
+   base = (unsigned long)from-iov_base + offset1;
+   size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
+   num_pages = get_user_pages_fast(base, size, 0, page[i]);
+   if ((num_pages != size) ||
+   (num_pages  MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags))
+   /* put_page is in skb free */
+   return -EFAULT;
+   while (len) {
+   f = skb_shinfo(skb)-frags[i];
+   f-page = page[i];
+   f-page_offset = base  ~PAGE_MASK;
+   f-size = min_t(int, len, PAGE_SIZE - f-page_offset);
+   skb-data_len += f-size;
+   skb-len += f-size;
+   skb-truesize += f-size;
+   skb_shinfo(skb)-nr_frags++;
+   /* increase sk_wmem_alloc */
+   if (skb-sk  skb-destructor == sock_wfree)
+   atomic_add(f-size, skb-sk-sk_wmem_alloc);
+   

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-13 Thread David Miller
From: Shirley Ma mashi...@us.ibm.com
Date: Mon, 13 Sep 2010 13:48:03 -0700

 + base = (unsigned long)from-iov_base + offset1;
 + size = ((base  ~PAGE_MASK) + len + ~PAGE_MASK)  PAGE_SHIFT;
 + num_pages = get_user_pages_fast(base, size, 0, page[i]);
 + if ((num_pages != size) ||
 + (num_pages  MAX_SKB_FRAGS - skb_shinfo(skb)-nr_frags))
 + /* put_page is in skb free */
 + return -EFAULT;

What keeps the user from writing to these pages in it's address space
after the write call returns?

A write() return of success means:

I wrote what you gave to me

not

I wrote what you gave to me, oh and BTW don't touch these
 pages for a while.

In fact a while isn't even defined in any way, as there is no way
for the write() invoker to know when the networking card is done with
those pages.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html