Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-07-03 Thread Herbert Xu
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:

 OK, if I understand you correctly then I don't think have a
 problem.  With your current patch-set you have exactly the same
 situation when the skb-data is reallocated as a kernel buffer.
 
 When will skb-data to be reallocated? May you point me the code path?

Anything that calls pskb_expand_head.

 This is OK because as you correctly argue, it is a rare situation.
 
 With my proposal you will need to get this extra external buffer
 in even less cases, because you'd only need to do it if the skb
 head grows, which only happens if it becomes encapsulated.
 So let me explain it in a bit more detail:
 
 Our packet starts out as a purely non-linear skb, i.e., skb-head
 contains nothing and all the page frags come from the guest.
 
 During host processing we may pull data into skb-head but the
 first frag will remain unless we pull all of it.  If we did do
 that then you would have a free external buffer anyway.
 
 Now in the common case the header may be modified or pulled, but
 it very rarely grows.  So you can just copy the header back into
 the first frag just before we give it to the guest.
 
 Since the data is still there, so recompute the page offset and size is ok, 
 right?

Right, you just move the page offset back and increase the size.
However, to do this safely we'd need to have a way of knowing
whether the skb head has been modified.

It may well turn out to be just as effective to do something like

if (memcmp(skb-data, page frag head, skb_headlen))
memcpy(page frag head, skb-data, skb_headlen)

As the page frag head should be in cache since it would've been
used to populate skb-data.

It'd be good to run some benchmarks with this to see whether
adding a bit to sk_buff to avoid the memcmp is worth it or not.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-28 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Sunday, June 27, 2010 2:15 PM
To: Dong, Eddie
Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:

 In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
 complete
queue pairs) uses the buffer from guest, so it eliminates copy completely in 
software and
requires hardware to do so. If we can have an additonal place to store the 
buffer per skb (may
cause copy later on), we can do copy later on or re-post the buffer to 
assigned NIC driver
later on. But that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem.  With your current patch-set you have exactly the same
situation when the skb-data is reallocated as a kernel buffer.


When will skb-data to be reallocated? May you point me the code path?

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.
So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb-head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb-head but the
first frag will remain unless we pull all of it.  If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows.  So you can just copy the header back into
the first frag just before we give it to the guest.

Since the data is still there, so recompute the page offset and size is ok, 
right?

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
--
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-28 Thread Michael S. Tsirkin
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
 -Original Message-
 From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
 Sent: Sunday, June 27, 2010 2:15 PM
 To: Dong, Eddie
 Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; 
 kvm@vger.kernel.org;
 linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
 da...@davemloft.net;
 jd...@linux.intel.com
 Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
 external.
 
 On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
 
  In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
  complete
 queue pairs) uses the buffer from guest, so it eliminates copy completely in 
 software and
 requires hardware to do so. If we can have an additonal place to store the 
 buffer per skb (may
 cause copy later on), we can do copy later on or re-post the buffer to 
 assigned NIC driver
 later on. But that may be not very clean either :(
 
 OK, if I understand you correctly then I don't think have a
 problem.  With your current patch-set you have exactly the same
 situation when the skb-data is reallocated as a kernel buffer.
 
 
 When will skb-data to be reallocated? May you point me the code path?
 
 This is OK because as you correctly argue, it is a rare situation.
 
 With my proposal you will need to get this extra external buffer
 in even less cases, because you'd only need to do it if the skb
 head grows, which only happens if it becomes encapsulated.
 So let me explain it in a bit more detail:
 
 Our packet starts out as a purely non-linear skb, i.e., skb-head
 contains nothing and all the page frags come from the guest.
 
 During host processing we may pull data into skb-head but the
 first frag will remain unless we pull all of it.  If we did do
 that then you would have a free external buffer anyway.
 
 Now in the common case the header may be modified or pulled, but
 it very rarely grows.  So you can just copy the header back into
 the first frag just before we give it to the guest.
 
 Since the data is still there, so recompute the page offset and size is ok, 
 right?

Question: can devices use parts of the same page
in frags of different skbs (or for other purposes)? If they do,
we'll corrupt that memory if we try to stick the header there.

We have another option, reserve some buffers
posted by guest and use them if we need to copy
the header. This seems the most straight-forward to me.

 Only in the case where the packet header grows (e.g., encapsulation)
 would you need to get an extra external buffer.
 
 Cheers,
 --
 Email: Herbert Xu herb...@gondor.apana.org.au
 Home Page: http://gondor.apana.org.au/~herbert/
 PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-27 Thread Herbert Xu
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
 
 In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
 complete queue pairs) uses the buffer from guest, so it eliminates copy 
 completely in software and requires hardware to do so. If we can have an 
 additonal place to store the buffer per skb (may cause copy later on), we can 
 do copy later on or re-post the buffer to assigned NIC driver later on. But 
 that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem.  With your current patch-set you have exactly the same
situation when the skb-data is reallocated as a kernel buffer.

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.

So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb-head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb-head but the
first frag will remain unless we pull all of it.  If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows.  So you can just copy the header back into
the first frag just before we give it to the guest.

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-25 Thread Michael S. Tsirkin
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
 Herbert Xu wrote:
  On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
  
  I mean once the frontend side driver post the buffers to the backend
  driver, the backend driver will immediately use that buffers to
  compose skb or gro_frags and post them to the assigned host NIC
  driver as receive buffers. In that case, if the backend driver
  recieves a packet from the NIC that requires to do copy, it may be
  unable to find additional free guest buffer because all of them are
  already used by the NIC driver. We have to reserve some guest
  buffers for the possible copy even if the buffer address is not
  identified by original skb :(
  
  OK I see what you mean.  Can you tell me how does Xiaohui's
  previous patch-set deal with this problem?
  
  Thanks,
 
 In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
 complete queue pairs) uses the buffer from guest, so it eliminates copy 
 completely in software and requires hardware to do so. If we can have an 
 additonal place to store the buffer per skb (may cause copy later on), we can 
 do copy later on or re-post the buffer to assigned NIC driver later on. But 
 that may be not very clean either :(
 BTW, some hardware may require certain level of packet copy such as for 
 broadcast packets in very old VMDq device, which is not addressed in previous 
 Xiaohui's patch yet. We may address this by implementing an additional 
 virtqueue between guest and host for slow path (broadcast packets only here) 
 with additinal complexity in FE/BE driver. 
 
 Thx, Eddie

guest posts a large number of buffers to the host.
Host can use them any way it wants to, and in any order,
for example reserve half the buffers for the copy.

This might waste some memory if buffers are used
only partially, but let's worry about this later.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Herbert Xu
On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
 
 I mean once the frontend side driver post the buffers to the backend driver, 
 the backend driver will immediately use that buffers to compose skb or 
 gro_frags and post them to the assigned host NIC driver as receive buffers. 
 In that case, if the backend driver recieves a packet from the NIC that 
 requires to do copy, it may be unable to find additional free guest buffer 
 because all of them are already used by the NIC driver. We have to reserve 
 some guest buffers for the possible copy even if the buffer address is not 
 identified by original skb :(

OK I see what you mean.  Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Dong, Eddie
Herbert Xu wrote:
 On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
 
 I mean once the frontend side driver post the buffers to the backend
 driver, the backend driver will immediately use that buffers to
 compose skb or gro_frags and post them to the assigned host NIC
 driver as receive buffers. In that case, if the backend driver
 recieves a packet from the NIC that requires to do copy, it may be
 unable to find additional free guest buffer because all of them are
 already used by the NIC driver. We have to reserve some guest
 buffers for the possible copy even if the buffer address is not
 identified by original skb :(
 
 OK I see what you mean.  Can you tell me how does Xiaohui's
 previous patch-set deal with this problem?
 
 Thanks,

In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
complete queue pairs) uses the buffer from guest, so it eliminates copy 
completely in software and requires hardware to do so. If we can have an 
additonal place to store the buffer per skb (may cause copy later on), we can 
do copy later on or re-post the buffer to assigned NIC driver later on. But 
that may be not very clean either :(

BTW, some hardware may require certain level of packet copy such as for 
broadcast packets in very old VMDq device, which is not addressed in previous 
Xiaohui's patch yet. We may address this by implementing an additional 
virtqueue between guest and host for slow path (broadcast packets only here) 
with additinal complexity in FE/BE driver. 

Thx, Eddie--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Xin, Xiaohui
Herbert,
That's why I have sent you the patch for guest virtio-net driver. I reserved 
512 bytes in each page, then I can always have the space to copy and avoid the 
backend memory used up issue.

Thanks
Xiaohui

-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Thursday, June 24, 2010 6:09 PM
To: Dong, Eddie
Cc: Xin, Xiaohui; Stephen Hemminger; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:

 I mean once the frontend side driver post the buffers to the backend driver, 
 the backend
driver will immediately use that buffers to compose skb or gro_frags and 
post them to the
assigned host NIC driver as receive buffers. In that case, if the backend 
driver recieves a
packet from the NIC that requires to do copy, it may be unable to find 
additional free guest
buffer because all of them are already used by the NIC driver. We have to 
reserve some guest
buffers for the possible copy even if the buffer address is not identified by 
original skb :(

OK I see what you mean.  Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Dong, Eddie

 3) As I have mentioned above, with this idea, netdev_alloc_skb() will
 allocate 
 as usual, the data pointed by skb-data will be copied into the first
 guest buffer. 
 That means we should reserve sufficient room in guest buffer. For PS
 mode 
 supported driver (for example ixgbe), the room will be more than 128.
 After 128bytes, 
 we will put the first frag data. Look into virtio-net.c the function
 page_to_skb() 
 and receive_mergeable(), that means we should modify guest virtio-net
 driver to 
 compute the offset as the parameter for skb_set_frag().
 
 How do you think about this? Attached is a patch to how to modify the
 guest driver. 
 I reserve 512 bytes as an example, and transfer the header len of the
 skb in hdr-hdr_len. 
 
Xiaohui  Herbert:
Mixing copy of head  0-copy of bulk data imposes additional challange 
to find the guest buffer. The backend driver may be unable to find a spare 
guest buffer from virtqueue at that time which may block the receiving process 
then.
Can't we completely eliminate netdev_alloc_skb here? Assigning guest 
buffer at this time makes life much easier.
Thx, Eddie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Dong, Eddie
Herbert Xu wrote:
 On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
 
 Xiaohui  Herbert:
  Mixing copy of head  0-copy of bulk data imposes additional
  challange to find the guest buffer. The backend driver may be
 unable to find a spare guest buffer from virtqueue at that time
 which may block the receiving process then. Can't we completely
 eliminate netdev_alloc_skb here? Assigning guest buffer at this time
 makes life much easier.
 
 I'm not sure I understand you concern.  If you mean that when
 the guest doesn't give enough pages to the host and the host
 can't receive on behalf of the guest then isn't that already
 the case with the original patch-set?
 

I mean once the frontend side driver post the buffers to the backend driver, 
the backend driver will immediately use that buffers to compose skb or 
gro_frags and post them to the assigned host NIC driver as receive buffers. In 
that case, if the backend driver recieves a packet from the NIC that requires 
to do copy, it may be unable to find additional free guest buffer because all 
of them are already used by the NIC driver. We have to reserve some guest 
buffers for the possible copy even if the buffer address is not identified by 
original skb :(

Thx, Eddie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Herbert Xu
On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:

 Xiaohui  Herbert:
   Mixing copy of head  0-copy of bulk data imposes additional challange 
 to find the guest buffer. The backend driver may be unable to find a spare 
 guest buffer from virtqueue at that time which may block the receiving 
 process then.
   Can't we completely eliminate netdev_alloc_skb here? Assigning guest 
 buffer at this time makes life much easier.

I'm not sure I understand you concern.  If you mean that when
the guest doesn't give enough pages to the host and the host
can't receive on behalf of the guest then isn't that already
the case with the original patch-set?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:

 Changing the guest virtio to match the backend is a problem,
 this breaks migration etc.

As long as it's done in a backwards compatible way it should be
fine.  It's just like migrating from a backend that supports TSO
to one that doesn't.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Michael S. Tsirkin
On Sun, Jun 20, 2010 at 08:32:35PM +1000, Herbert Xu wrote:
 On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
 
  Changing the guest virtio to match the backend is a problem,
  this breaks migration etc.
 
 As long as it's done in a backwards compatible way it should be
 fine.

Possibly, but to me the need to do this implies that
we'll need another change with different hardware at the backend.

 It's just like migrating from a backend that supports TSO
 to one that doesn't.
 
 Cheers,

Exactly. We don't support such migration.

 -- 
 Visit Openswan at http://www.openswan.org/
 Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
 Home Page: http://gondor.apana.org.au/~herbert/
 PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:

  It's just like migrating from a backend that supports TSO
  to one that doesn't.
 
 Exactly. We don't support such migration.

Well that's something that has to be addressed in the virtio_net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Michael S. Tsirkin
On Sun, Jun 20, 2010 at 09:02:54PM +1000, Herbert Xu wrote:
 On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:
 
   It's just like migrating from a backend that supports TSO
   to one that doesn't.
  
  Exactly. We don't support such migration.
 
 Well that's something that has to be addressed in the virtio_net.

Rather than modifying all guests, it seems much easier not to assume
specific buffer layout in host.  Copying network header around seems a
small cost.

 Cheers,
 -- 
 Visit Openswan at http://www.openswan.org/
 Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
 Home Page: http://gondor.apana.org.au/~herbert/
 PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:

 Rather than modifying all guests, it seems much easier not to assume
 specific buffer layout in host.  Copying network header around seems a
 small cost.

Well sure we can debate the specifics of this implementation detail.

However, the fact that virtio_net doesn't support feature renegotiation
on live migration is not a valid reason against this.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:

 Let's do this then.  So far the virtio spec avoided making layout
 assumptions, leaving guests lay out data as they see fit.
 Isn't it possible to keep supporting this with zero copy for hardware
 that can issue DMA at arbitrary addresses?

I think you're mistaken with respect to what is being proposed.
Raising 512 bytes isn't a hard constraint, it is merely an
optimisation for Intel NICs because their PS mode can produce
a head fragment of up to 512 bytes.

If the guest didn't allocate 512 bytes it wouldn't be the end of
the world, it'd just mean that we'd either copy whatever is in
the head fragment, or we waste 4096-X bytes of memory where X
is the number of bytes in the head.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Michael S. Tsirkin
On Sun, Jun 20, 2010 at 09:59:26PM +1000, Herbert Xu wrote:
 On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
 
  Let's do this then.  So far the virtio spec avoided making layout
  assumptions, leaving guests lay out data as they see fit.
  Isn't it possible to keep supporting this with zero copy for hardware
  that can issue DMA at arbitrary addresses?
 
 I think you're mistaken with respect to what is being proposed.
 Raising 512 bytes isn't a hard constraint, it is merely an
 optimisation for Intel NICs because their PS mode can produce
 a head fragment of up to 512 bytes.
 
Thanks for the clarification. So what is discussed here is
the API changes that will enable this optimization?
Of couse, it makes sense to consider this to try and avoid code churn
in the future.

As a side note, I hope to see a basic zero copy implementation with
GSO/GRO that beats copy in host convincingly before work is started on
further optimizations, though.

 If the guest didn't allocate 512 bytes it wouldn't be the end of
 the world, it'd just mean that we'd either copy whatever is in
 the head fragment,
I don't know how much will copying the head cost.

 or we waste 4096-X bytes of memory where X
 is the number of bytes in the head.

This seems mostly harmless - and guest can always do a copy internally
to save memory, correct?
Note also that we lock a full page to allow DMA, anyway.

 Cheers,
 -- 
 Visit Openswan at http://www.openswan.org/
 Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
 Home Page: http://gondor.apana.org.au/~herbert/
 PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Ben Hutchings
On Sun, 2010-06-20 at 21:59 +1000, Herbert Xu wrote:
 On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
 
  Let's do this then.  So far the virtio spec avoided making layout
  assumptions, leaving guests lay out data as they see fit.
  Isn't it possible to keep supporting this with zero copy for hardware
  that can issue DMA at arbitrary addresses?
 
 I think you're mistaken with respect to what is being proposed.
 Raising 512 bytes isn't a hard constraint, it is merely an
 optimisation for Intel NICs because their PS mode can produce
 a head fragment of up to 512 bytes.
 
 If the guest didn't allocate 512 bytes it wouldn't be the end of
 the world, it'd just mean that we'd either copy whatever is in
 the head fragment, or we waste 4096-X bytes of memory where X
 is the number of bytes in the head.

If I understand correctly what this 'PS mode' is (I haven't seen the
documentation for it), it is a feature that Microsoft requested from
hardware vendors for use in Hyper-V.  As a result, the SFC9000 family
and presumably other controllers also implement something similar.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Herbert Xu
On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:

 Herbert,
 I have questions about the idea above:
 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), 
 then we don't need napi_gro_frags() any more, the driver's original receiving 
 function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags.  But now that you mention it, yes
we could still keep the old interface to minimise the work.

 2) Is napi_gro_frags() only suitable for TCP protocol packet? 
 I have done a small test for ixgbe driver to let it only allocate paged 
 buffers 
 and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet.  In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

 3) As I have mentioned above, with this idea, netdev_alloc_skb() will 
 allocate 
 as usual, the data pointed by skb-data will be copied into the first guest 
 buffer. 
 That means we should reserve sufficient room in guest buffer. For PS mode 
 supported driver (for example ixgbe), the room will be more than 128. After 
 128bytes, 
 we will put the first frag data. Look into virtio-net.c the function 
 page_to_skb()  
 and receive_mergeable(), that means we should modify guest virtio-net driver 
 to 
 compute the offset as the parameter for skb_set_frag().
 
 How do you think about this? Attached is a patch to how to modify the guest 
 driver.
 I reserve 512 bytes as an example, and transfer the header len of the skb in 
 hdr-hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time.  I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 18, 2010 1:59 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com; Rusty Russell
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:

 Herbert,
 I have questions about the idea above:
 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
 then we don't need napi_gro_frags() any more, the driver's original receiving
 function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags.  But now that you mention it, yes
we could still keep the old interface to minimise the work.

 2) Is napi_gro_frags() only suitable for TCP protocol packet?
 I have done a small test for ixgbe driver to let it only allocate paged 
 buffers
 and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet.  In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

Thanks for the verification. By the way, does that mean that nearly all drivers 
can use the 
same napi_gro_frags() to receive buffers though currently each driver has it's 
own receiving 
function?

 3) As I have mentioned above, with this idea, netdev_alloc_skb() will 
 allocate
 as usual, the data pointed by skb-data will be copied into the first guest 
 buffer.
 That means we should reserve sufficient room in guest buffer. For PS mode
 supported driver (for example ixgbe), the room will be more than 128. After 
 128bytes,
 we will put the first frag data. Look into virtio-net.c the function 
 page_to_skb()
 and receive_mergeable(), that means we should modify guest virtio-net driver 
 to
 compute the offset as the parameter for skb_set_frag().

 How do you think about this? Attached is a patch to how to modify the guest 
 driver.
 I reserve 512 bytes as an example, and transfer the header len of the skb in 
 hdr-hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time.  I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Let me have a look to see if I can retain the copy threshold as 128 bytes 
and copy the header data safely.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Herbert Xu
On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:

 Thanks for the verification. By the way, does that mean that nearly all 
 drivers can use the 
 same napi_gro_frags() to receive buffers though currently each driver has 
 it's own receiving 
 function?

There is no reason why the napi_gro_frags can't be used by any
driver that supports receiving data into pages.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-17 Thread Herbert Xu
On Sat, Jun 12, 2010 at 05:31:10PM +0800, Xin, Xiaohui wrote:
 
 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
 zero-copyed.
   If the driver support PS mode, then modify alloc_page() too.

Well if you were doing this then the driver won't be generating
skbs at all.  So you don't need to change netdev_alloc_skb.

The driver would currently do alloc_page, so you would replace
that with netdev_alloc_page, which can call your new function
to allocate an external page where appropriate.

IOW you just need one change in the driver if it already uses
the skbless interface, to replace with alloc_page.

If the driver doesn't use the skbless interface then you need
to make a few more changes but it isn't too hard either, it'll
also mean that the driver will have less overhead even for normal
use which is a win-win situation.

 2) Add napi_gro_frags() in driver to receive the user pages instead of 
 driver's receiving 
 function.

 3) napi_gro_frags() will allocate small skb and pull the header data from 
 the first page to skb-data.
 
 Is above the way what you have suggested?

Yes.

 I have thought something in detail about the way.
 
 1) The first page will have an offset after the header is copied into 
 allocated kernel skb. 
 The offset should be recalculated when the user page data is transferred to 
 guest. This 
 may modify some of the gro code.

We could keep track whether the stack has modified the header,
since you can simply ignore it if it doesn't modify it, which
should be the common case for virt.

 2) napi_gro_frags() may remove a page when it's data is totally be pulled, 
 but we cannot 
 put a user page as normally. This may modify the gro code too.

If it does anything like that, then we're not in the fast-path
case so you can just fall back to copying.

 3) When the user buffer returned to guest, some of them need to be appended a 
 vnet header.
 That means for some pages, the vnet header room should be reserved when 
 allocated.
 But we cannot know which one will be used as the first page when allocated. 
 If we reserved vnet header for each page, since the set_skb_frag() in guest 
 driver only use the offset 0 for second pages, then page data will be wrong.

I don't see why this would be a problem, since as far as what
the driver is putting onto the physical RX ring nothing has
changed.  IOW if you want to designate a certain page as special,
or the first page, you can still do so.

So can you explain which bits of your patches would be affected
by this?

 4) Since the user buffer pages should be released, so we still need a dtor 
 callback to do that, and then I still need a place to hold it. How do you 
 think about to put it in skb_shinfo?

While I don't like that very much I guess I can live with that
if nobody else objects.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-17 Thread Herbert Xu
On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
 
 Herbert,
 In this way, I think we should create 3 functions at least in drivers to 
 allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.
 
 We can also have another way here. We can provide a function to only 
 substitute 
 alloc_page(), and a function to release the pages when cleaning the rx 
 buffers.

Yes that's exactly what I had in mind.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-13 Thread Xin, Xiaohui
-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
Behalf Of
Xin, Xiaohui
Sent: Saturday, June 12, 2010 5:31 PM
To: Herbert Xu
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 11, 2010 1:21 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:

 I'm not sure if I understand your way correctly:
 1) Does the way only deal with driver with SG feature? Since packet
 is non-linear...

No the hardware doesn't have to support SG.  You just need to
place the entire packet contents in a page instead of skb-head.

 2) Is skb-data still pointing to guest user buffers?
 If yes, how to avoid the modifications to net core change to skb?

skb-data would not point to guest user buffers.  In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

 3) In our way only parts of drivers need be modified to support zero-copy.
 and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

 I think to make skb-head empty at first will cause more effort to pass the 
 check with
 skb header. Have I missed something here? I really make the skb-head NULL
 just before kfree(skb) in skb_release_data(), it's done by callback we have 
 made for skb.

No I'm not suggesting you set it to NULL.  It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
zero-copyed.
  If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of 
driver's receiving
function.
3) napi_gro_frags() will allocate small skb and pull the header data from
the first page to skb-data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into 
allocated kernel skb.
The offset should be recalculated when the user page data is transferred to 
guest. This
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but 
we cannot
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a 
vnet header.
That means for some pages, the vnet header room should be reserved when 
allocated.
But we cannot know which one will be used as the first page when allocated. If 
we reserved
vnet header for each page, since the set_skb_frag() in guest driver only use 
the offset 0 for
second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor 
callback to do that,
and then I still need a place to hold it. How do you think about to put it in 
skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui

Herbert,
In this way, I think we should create 3 functions at least in drivers to 
allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.

We can also have another way here. We can provide a function to only substitute 
alloc_page(), and a function to release the pages when cleaning the rx buffers.
The skb for the rx buffer can be allocated in original way, and when pushing 
the data to guest, the header data will be copied to guest buffer. In this way, 
we 
should reserve sufficient room for the header in the first guest user buffers. 
That need modifications to guest virtio-net kernel. And this way only suitable 
for
PS mode supported driver. Considered the advanced driver mostly has PS mode.
So it should be not a critical issue.

Thanks
Xiaohui

RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-12 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Friday, June 11, 2010 1:21 PM
To: Xin, Xiaohui
Cc: Stephen Hemminger; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:

 I'm not sure if I understand your way correctly:
 1) Does the way only deal with driver with SG feature? Since packet
 is non-linear...

No the hardware doesn't have to support SG.  You just need to
place the entire packet contents in a page instead of skb-head.

 2) Is skb-data still pointing to guest user buffers?
 If yes, how to avoid the modifications to net core change to skb?

skb-data would not point to guest user buffers.  In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

 3) In our way only parts of drivers need be modified to support zero-copy.
 and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

 I think to make skb-head empty at first will cause more effort to pass the 
 check with
 skb header. Have I missed something here? I really make the skb-head NULL
 just before kfree(skb) in skb_release_data(), it's done by callback we have 
 made for skb.

No I'm not suggesting you set it to NULL.  It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c).  This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if 
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
zero-copyed.
  If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of driver's 
receiving 
function.
3) napi_gro_frags() will allocate small skb and pull the header data from 
the first page to skb-data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into allocated 
kernel skb. 
The offset should be recalculated when the user page data is transferred to 
guest. This 
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but 
we cannot 
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a 
vnet header.
That means for some pages, the vnet header room should be reserved when 
allocated.
But we cannot know which one will be used as the first page when allocated. If 
we reserved vnet header for each page, since the set_skb_frag() in guest driver 
only use the offset 0 for second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor 
callback to do that, and then I still need a place to hold it. How do you think 
about to put it in skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Stephen Hemminger [mailto:shemmin...@vyatta.com]
Sent: Monday, June 07, 2010 7:14 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
herb...@gondor.apana.org.au;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...


Yes, I agree on your concern at some extent. But the skbs which use external 
pages in our cases have short travels which start from NIC driver and then 
forward to the guest immediately. It will not travel into host kernel stack or 
any filters which may avoid many problems you have mentioned here.

Another point is that we try to make the solution more generic to different NIC 
drivers, then many drivers may use the way without modifications.
  
2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?

The pool is not fixed size, it's the size of usable buffers submitted by guest 
virtio-net driver. Guest virtio-net will check how much buffers are filled and 
do resubmit. What a slow listener mean? A slow NIC driver?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
Of Andi
Kleen
Sent: Monday, June 07, 2010 3:51 PM
To: Stephen Hemminger
Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
herb...@gondor.apana.org.au; jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs
low on memory? How is that pool balanced against other kernel
memory users?

The size of the pool is limited by the virtqueue capacity now.
If the virtqueue is really huge, then how to balance on memory is a problem.
I did not consider it clearly how to tune it dynamically currently...

-Andi

--
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Mitchell Erblich [mailto:erbli...@earthlink.net]
Sent: Monday, June 07, 2010 4:17 PM
To: Andi Kleen
Cc: Stephen Hemminger; Xin, Xiaohui; net...@vger.kernel.org; 
kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
herb...@gondor.apana.org.au; jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.


On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

 Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come 
 from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?

 3. If they come from an internal pool what happens when the kernel runs
 low on memory? How is that pool balanced against other kernel
 memory users?

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

The internal pool here is a collection of user buffers submitted
by guest virtio-net driver. Guest put buffers here and driver get
buffers from it. If guest submit more buffers then driver needs,
we need somewhere to put the buffers, it's the internal pool here
to deal with. 

Now IMO,

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be
cleaned if dirty, then freed,

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).

Thanks for the thoughts.

Basically, the size of the internal pool is not decided by the pool itself,
To add/delete the objects in the pool is not a task of the pool itself too. 
It's decided by guest virtio-net driver and vhost-net driver both, and 
decided by the guest receive speed and submit speed.
The max size of the pool is limited by the virtqueue buffer numbers.

Thanks
Xiaohui




Mitchell Erblich
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-09 Thread Xin, Xiaohui
-Original Message-
From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
Sent: Tuesday, June 08, 2010 1:28 PM
To: Stephen Hemminger
Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu; 
da...@davemloft.net;
jd...@linux.intel.com
Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from 
external.

On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

I'm not sure if I understand your way correctly:
1) Does the way only deal with driver with SG feature? Since packet 
is non-linear...

2) Is skb-data still pointing to guest user buffers?
If yes, how to avoid the modifications to net core change to skb?

3) In our way only parts of drivers need be modified to support zero-copy.
and here, need we modify all the drivers?

If I missed your idea, may you explain it in more detail?

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.


If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb-head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb-head empty.

I think to make skb-head empty at first will cause more effort to pass the 
check with 
skb header. Have I missed something here? I really make the skb-head NULL
just before kfree(skb) in skb_release_data(), it's done by callback we have 
made for skb.

Thanks
Xiaohui

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-07 Thread Andi Kleen
Stephen Hemminger shemmin...@vyatta.com writes:

 Still not sure this is a good idea for a couple of reasons:

 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs 
low on memory? How is that pool balanced against other kernel
memory users?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-07 Thread Mitchell Erblich

On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

 Stephen Hemminger shemmin...@vyatta.com writes:
 
 Still not sure this is a good idea for a couple of reasons:
 
 1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...
 
 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?
 
 3. If they come from an internal pool what happens when the kernel runs 
 low on memory? How is that pool balanced against other kernel
 memory users?
 
 -Andi
 
 -- 
 a...@linux.intel.com -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

Now IMO, 

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num 
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be 
cleaned if dirty, then freed,  

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).



Mitchell Erblich--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-07 Thread Herbert Xu
On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
 Still not sure this is a good idea for a couple of reasons:
 
 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...
 
 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb-head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb-head empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-06 Thread Stephen Hemminger
Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...

2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-05 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+/* The structure is for a skb which skb-data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+   u8  *start;
+   int size;
+   struct skb_frag_struct *frags;
+   struct skb_shared_info *ushinfo;
+   void(*dtor)(struct skb_external_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb-data.  The lower 16 bits hold references to
  * the entire skb-data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html