Re: [PATCH v5 REPOST 0/6] fix hw_random stuck

2014-12-22 Thread Herbert Xu
On Mon, Dec 08, 2014 at 04:50:34PM +0800, Amos Kong wrote:
 When I hotunplug a busy virtio-rng device or try to access
 hwrng attributes in non-smp guest, it gets stuck.
 
 My hotplug tests:
 
 | test 0:
 |   hotunplug rng device from qemu monitor
 |
 | test 1:
 |   guest) # dd if=/dev/hwrng of=/dev/null 
 |   hotunplug rng device from qemu monitor
 |
 | test 2:
 |   guest) # dd if=/dev/random of=/dev/null 
 |   hotunplug rng device from qemu monitor
 |
 | test 4:
 |   guest) # dd if=/dev/hwrng of=/dev/null 
 |   cat /sys/devices/virtual/misc/hw_random/rng_*
 |
 | test 5:
 |   guest) # dd if=/dev/hwrng of=/dev/null
 |   cancel dd process after 10 seconds
 |   guest) # dd if=/dev/hwrng of=/dev/null 
 |   hotunplug rng device from qemu monitor
 |
 | test 6:
 |   use a fifo as rng backend, execute test 0 ~ 5 with no input of fifo

All applied.  Thanks a lot!
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/6] fix hw_random stuck

2014-12-05 Thread Herbert Xu
On Sat, Dec 06, 2014 at 12:16:36PM +0800, Amos Kong wrote:
 When I hotunplug a busy virtio-rng device or try to access
 hwrng attributes in non-smp guest, it gets stuck.

Please resend these via the linux-crypto mailing list so they
can be picked up by patchwork.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] hw_random: fix unregister race.

2014-11-11 Thread Herbert Xu
On Mon, Nov 10, 2014 at 09:47:27PM +0800, Herbert Xu wrote:
 On Mon, Nov 03, 2014 at 11:56:24PM +0800, Amos Kong wrote:
 
  @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
   
  if (rng-cleanup)
  rng-cleanup(rng);
 
 You need a compiler barrier here to prevent reordering.

Michael Büsch pointed out that we should actually have a memory
barrier here.  I thought we didn't need it because I was only
worried about the code in my original complaint.  However, expecting
driver writers to use correct synchronisation primitives is surely
asking too much.

So let's add an smp_wmb() here to ensure all side-effects of
cleanup is visible.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] hw_random: fix unregister race.

2014-11-10 Thread Herbert Xu
On Mon, Nov 03, 2014 at 11:56:24PM +0800, Amos Kong wrote:

 @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
  
   if (rng-cleanup)
   rng-cleanup(rng);

You need a compiler barrier here to prevent reordering.

 + rng-cleanup_done = true;

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] hw_random: fix unregister race.

2014-11-02 Thread Herbert Xu
On Sun, Nov 02, 2014 at 11:06:13PM +0800, Amos Kong wrote:
 On Fri, Oct 31, 2014 at 03:23:21PM +0800, Herbert Xu wrote:
  On Fri, Oct 31, 2014 at 10:28:00AM +1030, Rusty Russell wrote:
   Herbert Xu herb...@gondor.apana.org.au writes:
On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote:
From: Rusty Russell ru...@rustcorp.com.au

The previous patch added one potential problem: we can still be
reading from a hwrng when it's unregistered.  Add a wait for zero
in the hwrng_unregister path.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
   
You totally corrupted Rusty's patch.  If you're going to repost
his series you better make sure that you've got the right patches.
   
Just as well though as it made me think a little more about this
patch :)
   
   OK Amos, can you please repost the complete series?
  
  Please fix the unregister race I identified first.
 
 Ok, I redownload the patches from 
 https://patchwork.kernel.org/project/LKML/list/?submitter=5
 and rebase fixes of me and rusty on it. I will post a V4 later. Thanks.

Please fix the unregister race I pointed out before doing a v4.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] hw_random: fix unregister race.

2014-10-31 Thread Herbert Xu
On Fri, Oct 31, 2014 at 10:28:00AM +1030, Rusty Russell wrote:
 Herbert Xu herb...@gondor.apana.org.au writes:
  On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote:
  From: Rusty Russell ru...@rustcorp.com.au
  
  The previous patch added one potential problem: we can still be
  reading from a hwrng when it's unregistered.  Add a wait for zero
  in the hwrng_unregister path.
  
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
  You totally corrupted Rusty's patch.  If you're going to repost
  his series you better make sure that you've got the right patches.
 
  Just as well though as it made me think a little more about this
  patch :)
 
 OK Amos, can you please repost the complete series?

Please fix the unregister race I identified first.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] hw_random: fix unregister race.

2014-10-21 Thread Herbert Xu
On Thu, Sep 18, 2014 at 12:18:24PM +0930, Rusty Russell wrote:
 The previous patch added one potential problem: we can still be
 reading from a hwrng when it's unregistered.  Add a wait for zero
 in the hwrng_unregister path.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/char/hw_random/core.c | 5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
 index dc9092a1075d..b4a21e9521cf 100644
 --- a/drivers/char/hw_random/core.c
 +++ b/drivers/char/hw_random/core.c
 @@ -60,6 +60,7 @@ static DEFINE_MUTEX(rng_mutex);
  static DEFINE_MUTEX(reading_mutex);
  static int data_avail;
  static u8 *rng_buffer, *rng_fillbuf;
 +static DECLARE_WAIT_QUEUE_HEAD(rng_done);
  static unsigned short current_quality;
  static unsigned short default_quality; /* = 0; default to off */
  
 @@ -98,6 +99,7 @@ static inline void cleanup_rng(struct kref *kref)
  
   if (rng-cleanup)
   rng-cleanup(rng);
 + wake_up_all(rng_done);
  }
  
  static void set_current_rng(struct hwrng *rng)
 @@ -529,6 +531,9 @@ void hwrng_unregister(struct hwrng *rng)
   }
  
   mutex_unlock(rng_mutex);
 +
 + /* Just in case rng is reading right now, wait. */
 + wait_event(rng_done, atomic_read(rng-ref.refcount) == 0);

While it's obviously better than what we have now, I don't believe
this is 100% safe as the cleanup function might still be running
even after the ref count hits zero.  Once we return from this function
the module may be unloaded so we need to ensure that nothing is
running at this point.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] hw_random: fix unregister race.

2014-10-21 Thread Herbert Xu
On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote:
 From: Rusty Russell ru...@rustcorp.com.au
 
 The previous patch added one potential problem: we can still be
 reading from a hwrng when it's unregistered.  Add a wait for zero
 in the hwrng_unregister path.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au

You totally corrupted Rusty's patch.  If you're going to repost
his series you better make sure that you've got the right patches.

Just as well though as it made me think a little more about this
patch :)

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] fix stuck in accessing hwrng attributes

2014-09-17 Thread Herbert Xu
On Tue, Sep 16, 2014 at 12:02:26AM +0800, Amos Kong wrote:
 If we read hwrng by long-running dd process, it takes too much cpu
 time and almost hold the mutex lock. When we check hwrng attributes
 from sysfs by cat, it gets stuck in waiting the lock releaseing.
 The problem can only be reproduced with non-smp guest with slow backend.
 
 This patchset resolves the issue by changing rng_dev_read() to always
 schedule 10 jiffies after release mutex lock, then cat process can
 have chance to get the lock and execute protected code without stuck.

Sorry I'm not going to accept your fix which simply papers over
the problem.

Please bite the bullet and convert this over to RCU.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni

2014-02-16 Thread Herbert Xu
On Wed, Feb 12, 2014 at 09:50:03AM +0800, Qin Chuanyu wrote:
 On 2014/2/11 23:24, Eric Dumazet wrote:
 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index 44c4db8..90b4e58 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct
 *tun, struct tun_file *tfile,
 skb_probe_transport_header(skb, 0);
 
 rxhash = skb_get_hash(skb);
 -   netif_rx_ni(skb);
 +   rcu_read_lock_bh();
 +   netif_receive_skb(skb);
 +   rcu_read_unlock_bh();
 
 tun-dev-stats.rx_packets++;
 tun-dev-stats.rx_bytes += len;
 
 I already said this patch is not good :
 
 rcu_read_lock_bh() makes no sense here.
 
 What is really needed is local_bh_disable();
 
 Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much
 cleaner form.
 
 Just use it, CC him, credit him, please ?
 
 To: Herbert Xu
 I saw that you had delivered a patch to resolve the problem of cgroup.
 patch num is f845172531fb7410c7fb7780b1a6e51ee6df7d52
 
 so would you deliver your patch for tun again? I had test it and
 found it work well.

Feel free to resubmit my patch either as is or with your
modifications as I no longer need it for my original purposes.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-rng only returns zeros with CONFIG_HW_RANDOM=m

2013-03-10 Thread Herbert Xu
On Wed, Feb 27, 2013 at 11:56:55AM +1030, Rusty Russell wrote:

 OK, I looked at doing a kmalloc and copy in virtio_rng, but it's very
 inelegant (we don't know what size of buffer to allocate).

As the hwrng API allows you to return any number of bytes, you
can just go back to the old virtio-rng code that did the 64-byte
buffer and return a maximum of 64 bytes each time.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] skbuff: clear tx zero-copy flag

2011-07-25 Thread Herbert Xu
On Mon, Jul 25, 2011 at 11:07:43AM +0300, Michael S. Tsirkin wrote:

 However macvtap passes an skb directly to the
 lower device, so as long as macvtap is the only user
 of that interface, we are fine I think - there's
 no way for an skb to get from macvtap to splice
 read path I think.
 
 Right?

Yes, as long as you can guarantee that the skb never loops back
then you should be fine.

However, does macvtap really bypass everything, including the
qdisc layer? The qdisc layer is certainly capable of looping
the skb back with the redirect action.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] skbuff: clear tx zero-copy flag

2011-07-25 Thread Herbert Xu
On Mon, Jul 25, 2011 at 12:44:14PM +0300, Michael S. Tsirkin wrote:

 if yes that seems to always clone an skb, which in turn
 does the copy so we are fine?

Yes you're right, it should be safe.

However, I think we should add a WARN_ON to the splice skb path
so that should a packet find its way through a path that we haven't
thought of then at least we'll know about it.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] skbuff: clear tx zero-copy flag

2011-07-24 Thread Herbert Xu
Shirley Ma mashi...@us.ibm.com wrote:
 
 This patch clears tx zero-copy flag as needed.
 
 Sign-off-by: Shirley Ma x...@us.ibm.com

I think we also need to copy and clear this flag on the splice
read path as that takes a direct page reference.

I hope there isn't any other path that does this.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

2011-03-18 Thread Herbert Xu
Shirley Ma mashi...@us.ibm.com wrote:

 +   /* Drop packet instead of stop queue for better performance */

I would like to see some justification as to why this is the right
way to go and not just papering over the real problem.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netfilter: add CHECKSUM target

2010-07-11 Thread Herbert Xu
On Sun, Jul 11, 2010 at 12:21:17PM +0300, Michael S. Tsirkin wrote:

  I'd think that this target would be protocol-agnostic, no?
 
 Meaning it should go into net/netfilter/? Will do.

Right, and you should get rid of any IPv4 dependencies.

Thanks,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netfilter: add CHECKSUM target

2010-07-09 Thread Herbert Xu
On Fri, Jul 09, 2010 at 01:29:13AM +0300, Michael S. Tsirkin wrote:
 This adds a `CHECKSUM' target, which can be used in the iptables mangle
 table.
 
 You can use this target to compute and fill in the checksum in
 an IP packet that lacks a checksum.  This is particularly useful,
 if you need to work around old applications such as dhcp clients,
 that do not work well with checksum offloads, but don't want to
 disable checksum offload in your device.
 
 The problem happens in the field with virtualized applications.
 For reference, see Red Hat bz 60, as well as
 http://www.spinics.net/lists/kvm/msg37660.html
 
 Typical expected use (helps old dhclient binary running in a VM):
 iptables -A POSTROUTING -t mangle -p udp --dport 68 -j CHECKSUM
 --checksum-fill
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

I'd think that this target would be protocol-agnostic, no?

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-07-03 Thread Herbert Xu
On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:

 OK, if I understand you correctly then I don't think have a
 problem.  With your current patch-set you have exactly the same
 situation when the skb-data is reallocated as a kernel buffer.
 
 When will skb-data to be reallocated? May you point me the code path?

Anything that calls pskb_expand_head.

 This is OK because as you correctly argue, it is a rare situation.
 
 With my proposal you will need to get this extra external buffer
 in even less cases, because you'd only need to do it if the skb
 head grows, which only happens if it becomes encapsulated.
 So let me explain it in a bit more detail:
 
 Our packet starts out as a purely non-linear skb, i.e., skb-head
 contains nothing and all the page frags come from the guest.
 
 During host processing we may pull data into skb-head but the
 first frag will remain unless we pull all of it.  If we did do
 that then you would have a free external buffer anyway.
 
 Now in the common case the header may be modified or pulled, but
 it very rarely grows.  So you can just copy the header back into
 the first frag just before we give it to the guest.
 
 Since the data is still there, so recompute the page offset and size is ok, 
 right?

Right, you just move the page offset back and increase the size.
However, to do this safely we'd need to have a way of knowing
whether the skb head has been modified.

It may well turn out to be just as effective to do something like

if (memcmp(skb-data, page frag head, skb_headlen))
memcpy(page frag head, skb-data, skb_headlen)

As the page frag head should be in cache since it would've been
used to populate skb-data.

It'd be good to run some benchmarks with this to see whether
adding a bit to sk_buff to avoid the memcmp is worth it or not.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-27 Thread Herbert Xu
On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
 
 In current patch, each SKB for the assigned device (SRIOV VF or NIC or a 
 complete queue pairs) uses the buffer from guest, so it eliminates copy 
 completely in software and requires hardware to do so. If we can have an 
 additonal place to store the buffer per skb (may cause copy later on), we can 
 do copy later on or re-post the buffer to assigned NIC driver later on. But 
 that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem.  With your current patch-set you have exactly the same
situation when the skb-data is reallocated as a kernel buffer.

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.

So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb-head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb-head but the
first frag will remain unless we pull all of it.  If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows.  So you can just copy the header back into
the first frag just before we give it to the guest.

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-24 Thread Herbert Xu
On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
 
 I mean once the frontend side driver post the buffers to the backend driver, 
 the backend driver will immediately use that buffers to compose skb or 
 gro_frags and post them to the assigned host NIC driver as receive buffers. 
 In that case, if the backend driver recieves a packet from the NIC that 
 requires to do copy, it may be unable to find additional free guest buffer 
 because all of them are already used by the NIC driver. We have to reserve 
 some guest buffers for the possible copy even if the buffer address is not 
 identified by original skb :(

OK I see what you mean.  Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-23 Thread Herbert Xu
On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:

 Xiaohui  Herbert:
   Mixing copy of head  0-copy of bulk data imposes additional challange 
 to find the guest buffer. The backend driver may be unable to find a spare 
 guest buffer from virtqueue at that time which may block the receiving 
 process then.
   Can't we completely eliminate netdev_alloc_skb here? Assigning guest 
 buffer at this time makes life much easier.

I'm not sure I understand you concern.  If you mean that when
the guest doesn't give enough pages to the host and the host
can't receive on behalf of the guest then isn't that already
the case with the original patch-set?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:

 Changing the guest virtio to match the backend is a problem,
 this breaks migration etc.

As long as it's done in a backwards compatible way it should be
fine.  It's just like migrating from a backend that supports TSO
to one that doesn't.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:

  It's just like migrating from a backend that supports TSO
  to one that doesn't.
 
 Exactly. We don't support such migration.

Well that's something that has to be addressed in the virtio_net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:

 Rather than modifying all guests, it seems much easier not to assume
 specific buffer layout in host.  Copying network header around seems a
 small cost.

Well sure we can debate the specifics of this implementation detail.

However, the fact that virtio_net doesn't support feature renegotiation
on live migration is not a valid reason against this.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Herbert Xu
On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:

 Let's do this then.  So far the virtio spec avoided making layout
 assumptions, leaving guests lay out data as they see fit.
 Isn't it possible to keep supporting this with zero copy for hardware
 that can issue DMA at arbitrary addresses?

I think you're mistaken with respect to what is being proposed.
Raising 512 bytes isn't a hard constraint, it is merely an
optimisation for Intel NICs because their PS mode can produce
a head fragment of up to 512 bytes.

If the guest didn't allocate 512 bytes it wouldn't be the end of
the world, it'd just mean that we'd either copy whatever is in
the head fragment, or we waste 4096-X bytes of memory where X
is the number of bytes in the head.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Herbert Xu
On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:

 Herbert,
 I have questions about the idea above:
 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(), 
 then we don't need napi_gro_frags() any more, the driver's original receiving 
 function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags.  But now that you mention it, yes
we could still keep the old interface to minimise the work.

 2) Is napi_gro_frags() only suitable for TCP protocol packet? 
 I have done a small test for ixgbe driver to let it only allocate paged 
 buffers 
 and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet.  In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

 3) As I have mentioned above, with this idea, netdev_alloc_skb() will 
 allocate 
 as usual, the data pointed by skb-data will be copied into the first guest 
 buffer. 
 That means we should reserve sufficient room in guest buffer. For PS mode 
 supported driver (for example ixgbe), the room will be more than 128. After 
 128bytes, 
 we will put the first frag data. Look into virtio-net.c the function 
 page_to_skb()  
 and receive_mergeable(), that means we should modify guest virtio-net driver 
 to 
 compute the offset as the parameter for skb_set_frag().
 
 How do you think about this? Attached is a patch to how to modify the guest 
 driver.
 I reserve 512 bytes as an example, and transfer the header len of the skb in 
 hdr-hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time.  I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-18 Thread Herbert Xu
On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:

 Thanks for the verification. By the way, does that mean that nearly all 
 drivers can use the 
 same napi_gro_frags() to receive buffers though currently each driver has 
 it's own receiving 
 function?

There is no reason why the napi_gro_frags can't be used by any
driver that supports receiving data into pages.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-17 Thread Herbert Xu
On Sat, Jun 12, 2010 at 05:31:10PM +0800, Xin, Xiaohui wrote:
 
 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is 
 zero-copyed.
   If the driver support PS mode, then modify alloc_page() too.

Well if you were doing this then the driver won't be generating
skbs at all.  So you don't need to change netdev_alloc_skb.

The driver would currently do alloc_page, so you would replace
that with netdev_alloc_page, which can call your new function
to allocate an external page where appropriate.

IOW you just need one change in the driver if it already uses
the skbless interface, to replace with alloc_page.

If the driver doesn't use the skbless interface then you need
to make a few more changes but it isn't too hard either, it'll
also mean that the driver will have less overhead even for normal
use which is a win-win situation.

 2) Add napi_gro_frags() in driver to receive the user pages instead of 
 driver's receiving 
 function.

 3) napi_gro_frags() will allocate small skb and pull the header data from 
 the first page to skb-data.
 
 Is above the way what you have suggested?

Yes.

 I have thought something in detail about the way.
 
 1) The first page will have an offset after the header is copied into 
 allocated kernel skb. 
 The offset should be recalculated when the user page data is transferred to 
 guest. This 
 may modify some of the gro code.

We could keep track whether the stack has modified the header,
since you can simply ignore it if it doesn't modify it, which
should be the common case for virt.

 2) napi_gro_frags() may remove a page when it's data is totally be pulled, 
 but we cannot 
 put a user page as normally. This may modify the gro code too.

If it does anything like that, then we're not in the fast-path
case so you can just fall back to copying.

 3) When the user buffer returned to guest, some of them need to be appended a 
 vnet header.
 That means for some pages, the vnet header room should be reserved when 
 allocated.
 But we cannot know which one will be used as the first page when allocated. 
 If we reserved vnet header for each page, since the set_skb_frag() in guest 
 driver only use the offset 0 for second pages, then page data will be wrong.

I don't see why this would be a problem, since as far as what
the driver is putting onto the physical RX ring nothing has
changed.  IOW if you want to designate a certain page as special,
or the first page, you can still do so.

So can you explain which bits of your patches would be affected
by this?

 4) Since the user buffer pages should be released, so we still need a dtor 
 callback to do that, and then I still need a place to hold it. How do you 
 think about to put it in skb_shinfo?

While I don't like that very much I guess I can live with that
if nobody else objects.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-17 Thread Herbert Xu
On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
 
 Herbert,
 In this way, I think we should create 3 functions at least in drivers to 
 allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.
 
 We can also have another way here. We can provide a function to only 
 substitute 
 alloc_page(), and a function to release the pages when cleaning the rx 
 buffers.

Yes that's exactly what I had in mind.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-07 Thread Herbert Xu
On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
 Still not sure this is a good idea for a couple of reasons:
 
 1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel.  So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...
 
 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb-head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb-head empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2010-01-10 Thread Herbert Xu
On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote:

 This isn't in 2.6.27.y.  Herbert, can you send it there?

It appears that now that TX is fixed we have a similar problem
with RX.  Once I figure that one out I'll send them together.

Who is maintaining that BTW, sta...@kernel.org?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2009-12-17 Thread Herbert Xu
On Thu, Dec 17, 2009 at 01:15:46PM -0500, rek2 wrote:

 I been told that today the network when down again and one of the guys  
 here had to log using the console and restart it for that particular  
 guests..

 on the guest:
  uname -a
 Linux  2.6.27.25-170.2.72.fc10.x86_64 #1 SMP Sun Jun 21 18:39:34 EDT  
 2009 x86_64 x86_64 x86_64 GNU/Linux

 Next time it goes down I will try to run a sniffer and try both sides.

OK I'm fairly sure this version has a buggy virtio-net.  Does
this patch (if it applies :) help?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9eec5a5..74b3854 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -521,8 +521,10 @@ static void xmit_tasklet(unsigned long data)
vi-svq-vq_ops-kick(vi-svq);
vi-last_xmit_skb = NULL;
}
-   if (vi-free_in_tasklet)
+   if (vi-free_in_tasklet) {
free_old_xmit_skbs(vi);
+   netif_wake_queue(vi-dev);
+   }
netif_tx_unlock_bh(vi-dev);
 }
 
Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2009-12-16 Thread Herbert Xu
On Wed, Dec 16, 2009 at 02:17:04PM +0200, Avi Kivity wrote:
 On 12/14/2009 05:49 PM, rek2 wrote:
 Hello, we notice that when we stress any of our guests, in this case  
 they are all fedora, the KVM network will shutdown.. anyone experience  
 this?

 Herbert?

What's the exact guest kernel version? When the network is down,
please get onto the guest console to determine which direction
(if not both) of the network is not functioning.

You can run tcpdump in the guest/host and execute pings on both
sides to see which direction is blocked.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-10 Thread Herbert Xu
On Tue, Jun 09, 2009 at 10:51:23PM -0700, Sridhar Samudrala wrote:

 Unfortunately, this doesn't work for UDP without any changes.  
 skb_segment() currently adds transport header to
 each segmented skb. But when we are using IP fragmentation, only the  
 first fragment should include the UDP
 header.
 We either need to fix skb_segment() to handle IP fragmentation or write  
 a new skb_fragment(). I will look into
 this when i get some time.

Couldn't you get around this by setting skb-transport_header to
skb-network_header before getting into skb_segment?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-07 Thread Herbert Xu
On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote:

 + /* Software UFO is not yet supported */
 + segs = ERR_PTR(-EPROTONOSUPPORT);

Hmm, we need to fill this in before you start using it for virt.
After all, it's very difficult for the guest to know whether the
output device on the host is going to be able to do UFO or not.

The whole reason I did GSO in the first place is to allow the guest
not to worry about whether the host supported TSO in hardware.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-03 Thread Herbert Xu
On Fri, Apr 03, 2009 at 01:18:54PM +0200, Andi Kleen wrote:
  Check shared ring status when stuffing a request.  If there are requests
 
 That means you're bouncing cache lines all the time. Probably not a big
 issue on single socket but could be on larger systems.

If the backend is running on a core that doesn't share caches
with the guest queue then you've got bigger problems.

Right this is unavoidable for guests with many CPUs but that
should go away once we support multiqueue in virtio-net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-03 Thread Herbert Xu
On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote:

 The host writes the packet to tap, at which point it is consumed from  
 its point of view.  The host would like to mention that if there was an  
 API to notify it when the packet was actually consumed, then it would  
 gladly use it.  Bonus points if this involves not copying the packet.

We're using write(2) for this, no? That should invoke netif_rx_ni
which blocks until the packet is processed, which usually means
that it's placed on the NIC's hardware queue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
Avi Kivity a...@redhat.com wrote:

 virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
 (s390 uses it).

I think Greg's work shows that putting the backend in the kernel
can dramatically reduce the cost of a single guest-host transaction.
I'm sure the same thing would work for virtio too.

 If you have a good exit mitigation scheme you can cut exits by a factor 
 of 100; so the userspace exit costs are cut by the same factor.  If you 
 have good copyless networking APIs you can cut the cost of copies to 
 zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
 needs that too).

Given the choice of having to mitigate or not having the problem
in the first place, guess what I would prefer :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote:

 I don't understand this.  If we had good interfaces, all that userspace  
 would do is translate guest physical addresses to host physical  
 addresses, and translate the guest-host protocol to host API calls.  I  
 don't see anything there that benefits from being in the kernel.

 Can you elaborate?

I think Greg has expressed it clearly enough.

At the end of the day, the numbers speak for themselves.  So if
and when there's a user-space version that achieves the same or
better results, then I will change my mind :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote:

 There is no choice.  Exiting from the guest to the kernel to userspace  
 is prohibitively expensive, you can't do that on every packet.

I was referring to the bit between the kernel and userspace.

In any case, I just looked at the virtio mitigation code again
and I am completely baffled at why we need it.  Look at Greg's
code or the netback/netfront notification, why do we need this
completely artificial mitigation when the ring itself provides
a natural way of stemming the flow?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:

 If tap told us when the packets were actually transmitted, life would be  
 wonderful:

And why do we need this? Because we are in user space!

I'll continue to wait for your patch and numbers :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 05:29:36PM +0800, Herbert Xu wrote:
 On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
 
  If tap told us when the packets were actually transmitted, life would be  
  wonderful:
 
 And why do we need this? Because we are in user space!
 
 I'll continue to wait for your patch and numbers :)

And in case you're working on that patch, this might interest
you.  Check out the netdev thread titled TX time stamping.
Now that we assign the tap skb with its own sk, these two scenarios
are pretty much identical.

I also noitced despite davem's threats to revert the patch, it
has now made Linus's tree :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote:

 Why does a kernel solution not need to know when a packet is transmitted?

Because you can install your own destructor?

I don't know what Greg did, but netback did that nasty page destructor
hack which Jeremy is trying to undo :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 12:43:54PM +0300, Avi Kivity wrote:

 So we're back to the problem is with the kernel-user interface, not  
 userspace being cursed into slowness.

Well until you have a patch + numbers that's only an allegation :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 12:03:32PM +0300, Avi Kivity wrote:

 Like Anthony said, the problem is with the kernel-user interfaces.  We  
 won't have a good user space virtio implementation until that is fixed.

If it's just the interface that's bad, then it should be possible
to do a proof-of-concept patch to show that this is the case.

Even if we have to redesign the interface, at least you can then
say that you guys were right all along :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:

 3ms latency for ping?

 (ping will always be scheduled immediately when the reply arrives if I  
 understand cfs, so guest load won't delay it)

That only happens if the guest immediately does some CPU-intensive
computation 3ms and assuming its timeslice lasts that long.

In any case, the same thing will happen right now if the host or
some other guest on the same CPU hogs the CPU for 3ms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
On Fri, Apr 03, 2009 at 01:06:10AM +0800, Herbert Xu wrote:

 That only happens if the guest immediately does some CPU-intensive
 computation 3ms and assuming its timeslice lasts that long.
 
 In any case, the same thing will happen right now if the host or
 some other guest on the same CPU hogs the CPU for 3ms.

Even better, look at the packet's TOS.  If it's marked for low-
latency then vmexit immediately.  Otherwise continue.

In the backend you'd just set the marker in shared memory.

Of course invert this for the host = guest direction.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-02 Thread Herbert Xu
Anthony Liguori anth...@codemonkey.ws wrote:

 Anyway, if we're able to send this many packets, I suspect we'll be able 
 to also handle much higher throughputs without TX mitigation so that's 
 what I'm going to look at now.

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Rusty Russell ru...@rustcorp.com.au wrote:
 
 As you point out, 350-450 is possible, which is still bad, and it's at least
 partially caused by the exit to userspace and two system calls.  If virtio_net
 had a backend in the kernel, we'd be able to compare numbers properly.

FWIW I don't really care whether we go with this or a kernel
virtio_net backend.  Either way should be good.  However the
status quo where we're stuck with a user-space backend really
sucks!

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Anthony Liguori anth...@codemonkey.ws wrote:

 That said, I don't think we're bound today by the fact that we're in 
 userspace.  Rather we're bound by the interfaces we have between the 
 host kernel and userspace to generate IO.  I'd rather fix those 
 interfaces than put more stuff in the kernel.

I'm sorry but I totally disagree with that.  By having our IO
infrastructure in user-space we've basically given up the main
advantage of kvm, which is that the physical drivers operate in
the same environment as the hypervisor.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Chris Wright chr...@sous-sol.org wrote:

 That said, I don't think we're bound today by the fact that we're in  
 userspace.  Rather we're bound by the interfaces we have between the  
 host kernel and userspace to generate IO.  I'd rather fix those  
 interfaces than put more stuff in the kernel.
 
 And more stuff in the kernel can come at the potential cost of weakening
 protection/isolation.

Protection/isolation always comes at a cost.  Not everyone wants
to pay that, just like health insurance :) We should enable the
users to choose which model they want, based on their needs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: copyless virtio net thoughts?

2009-02-18 Thread Herbert Xu
On Wed, Feb 18, 2009 at 10:08:00PM +1030, Rusty Russell wrote:

 4) Multiple queues
 This is Herbert's.  Should be fairly simple to add; it was in the back of my 
 mind when we started.  Not sure whether the queues should be static or 
 dynamic (imagine direct interguest networking, one queue pair for each other 
 guest), and how xmit queues would be selected by the guest (anything 
 anywhere, or dst mac?).

The primary purpose of multiple queues is to maximise CPU utilisation,
so the number of queues is simply dependent on the number of CPUs
allotted to the guest.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: copyless virtio net thoughts?

2009-02-05 Thread Herbert Xu
On Thu, Feb 05, 2009 at 02:37:07PM +0200, Avi Kivity wrote:

 I believe that copyless networking is absolutely essential.

I used to think it was important, but I'm now of the opinion
that it's quite useless for virtualisation as it stands.

 For transmit, copyless is needed to properly support sendfile() type  
 workloads - http/ftp/nfs serving.  These are usually high-bandwidth,  
 cache-cold workloads where a copy is most expensive.

This is totally true for baremetal, but useless for virtualisation
right now because the block layer is not zero-copy.  That is, the
data is going to be cache hot anyway so zero-copy networking doesn't
buy you much at all.

Please also recall that for the time being, block speeds are
way slower than network speeds.  So the really interesting case
is actually network-to-network transfers.  Again due to the
RX copy this is going to be cache hot.

 For receive, the guest will almost always do an additional copy, but it  
 will most likely do the copy from another cpu.  Xen netchannel2  

That's what we should strive to avoid.  The best scenario with
modern 10GbE NICs is to stay on one CPU if at all possible.  The
NIC will pick a CPU when it delivers the packet into one of the
RX queues and we should stick with it for as long as possible.

So what I'd like to see next in virtualised networking is virtual
multiqueue support in guest drivers.  No I'm not talking about
making one or more of the physical RX/TX queues available to the
guest (aka passthrough), but actually turning something like the
virtio-net interface into a multiqueue interface.

This is the best way to get cache locality and minimise CPU waste.

So I'm certainly not rushing out to do any zero-copy virtual
networking.  However, I would like to start working on a virtual
multiqueue NIC interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] virtio_net: Add a MAC filter table

2009-01-28 Thread Herbert Xu
On Thu, Jan 29, 2009 at 10:25:46AM +1030, Rusty Russell wrote:

 So, this conversation has convinced me that the host should
 accept arbitrary filtering entries, and the guest should accept
 that it is best effort.

I agree with this completely.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Work around dhclient brokenness

2008-08-24 Thread Herbert Xu
On Fri, Aug 15, 2008 at 02:47:12PM -0500, Anthony Liguori wrote:

 +static void work_around_broken_dhclient(struct virtio_net_hdr *hdr,
 +const uint8_t *buf, size_t size)
 +{
 +if ((hdr-flags  VIRTIO_NET_HDR_F_NEEDS_CSUM)  /* missing csum */
 +(size  18  size  1500)  /* normal sized MTU */
 +(buf[12] == 0x08  buf[13] == 0x00)  /* ethertype == IPv4 */
 +(buf[23] == 17)) { /* ip.protocol == UDP */
 +/* FIXME this cast is evil */
 +net_checksum_calculate((uint8_t *)buf, size);

If we're going to do this, how about just setting the checksum
to zero?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Work around dhclient brokenness

2008-08-19 Thread Herbert Xu
On Tue, Aug 19, 2008 at 07:08:23PM +1000, Rusty Russell wrote:
 
 Not really.  We could extend the protocol, but that's currently how feature 
 negotiation works: you can't do it while the device is live.  That seemed 
 simplest.  I learnt from Xen :)
 
 (Of course, we don't need to *disable* it, we need to *enable* it).

I don't see why we shouldn't support disabling it.  After all, any
NIC that supports receive checksum offload allows it to be disabled.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Work around dhclient brokenness

2008-08-18 Thread Herbert Xu
On Mon, Aug 18, 2008 at 02:40:55PM +0300, Avi Kivity wrote:

 Isn't that turned on automatically for real hardware?  And what's to 
 prevent a broken dhclient together with the (presumably) hacked up 
 initscripts that call ethtool?

Well the idea is that only a fixed guest would even know about
enabling this.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Work around dhclient brokenness

2008-08-18 Thread Herbert Xu
On Mon, Aug 18, 2008 at 10:13:55PM -0700, Chris Wedgwood wrote:
 
 CSUM2 sounds so ugly though.  Features seem to get added and never
 removed  how about if this had a documented short lifetime (if it
 really must go in)?

All we need is a simple toggle to disable checksum offload.  Every
NIC that offers receive checksum offload allows it to be disabled.
virtio shouldn't be any different.

Resetting the NIC seems a bit over the top.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] virtio_net perf patches

2008-08-12 Thread Herbert Xu
On Tue, Aug 12, 2008 at 07:12:28PM +0100, Mark McLoughlin wrote:

 Are you basically just saying that guests with a broken dhclient
 should manually disable rx checksum offload with ethtool? And that the
 host should react by disabling tx offload on the tap interfacE?

No I'm saying that everybody should default to no checksum offload.
Those that have working kernels + user-space can then enable it on
boot-up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Fix partial csum rx handling

2008-05-29 Thread Herbert Xu
On Tue, May 27, 2008 at 12:36:06PM +0100, Mark McLoughlin wrote:
 When we receive a packet with a partial csum,
 just set ip_summed to CHECKSUM_UNNECESSARY.
 
 Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
 Cc: Herbert Xu [EMAIL PROTECTED]

I'm not sure I follow the context of this patch.  Could you
explain what it's for?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html