Re: [PATCH 1/1] tun: TUNGETIFF interface to query name and flags

2008-08-15 Thread Max Krasnyansky
Mark McLoughlin wrote:
 On Thu, 2008-08-14 at 11:58 +1000, Rusty Russell wrote:
 On Thursday 14 August 2008 00:30:16 Mark McLoughlin wrote:
 A very simple approach is attached; I did consider doing a TUNGETFLAGS
 that would return tun-flags, but I think it's nicer to have a companion
 to TUNGETIFF since it also allows one to query the interface name from
 the file descriptor.
 This seems really sensible to me.

 If Max acks it,
 
 How about it Max?

Sorry for the delay Mark. I replied to your other email.
ACK.

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] stopmachine: add stopmachine_timeout

2008-07-14 Thread Max Krasnyansky


Heiko Carstens wrote:
 On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
 Rusty Russell wrote:
 On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
 Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
 
 + /* Wait all others come to life */
 + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
 + if (time_is_before_jiffies(limit))
 + goto timeout;
 + cpu_relax();
 + }
 +
   
 Hmm. I think this could become interesting on virtual machines. The
 hypervisor might be to busy to schedule a specific cpu at certain load
 scenarios. This would cause a failure even if the cpu is not really locked
 up. We had similar problems with the soft lockup daemon on s390.
 5 seconds is a fairly long time.  If all else fails we could have a config 
 option to simply disable this code.
 
 Hmm.. probably a stupid question: but what could happen that a real cpu
 (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
 prioritized task for 5 seconds?
I have a workload where MAX_PRIO RT thread runs and never yeilds. That's what
my cpu isolation patches/tree addresses. Stopmachine is the only (that I know
of) thing that really brakes in that case. btw In case you're wondering yes
we've discussed workqueue threads starvation and stuff in the other threads.
So yet it can happen.

 It would be good to not-use wall-clock time, but really used cpu time
 instead. Unfortunately I have no idea, if that is possible in a generic
 way. Heiko, any ideas?
 Ah, cpu time comes up again.  Perhaps we should actually dig that up again; 
 Zach and Jeremy CC'd.
 Hm, yeah. But in this case, it's tricky. CPU time is an inherently
 per-cpu quantity. If cpu A is waiting for cpu B, and wants to do the
 timeout in cpu-seconds, then it has to be in *B*s cpu-seconds (and if A
 is waiting on B,C,D,E,F... it needs to measure separate timeouts with
 separate timebases for each other CPU). It also means that if B is
 unresponsive but also not consuming any time (blocked in IO,
 administratively paused, etc), then the timeout will never trigger.

 So I think monotonic wallclock time actually makes the most sense here.
 
 This is asking for trouble... a config option to disable this would be
 nice. But as I don't know which problem this patch originally addresses
 it might be that this is needed anyway. So lets see why we need it first.
How about this. We'll make this a sysctl, as Rusty already did, and set the
default to 0 which means never timeout. That way crazy people like me who
care about this scenario can enable this feature.

btw Rusty, I just had this why didn't I think of that moments. This is
actually another way of handling my workload. I mean it certainly does not fix
the root case of the problems and we still need other things that we talked
about (non-blocking module delete, lock-free module insertion, etc) but at
least in the mean time it avoids wedging the machines for good.
btw I'd like that timeout in milliseconds. I think 5 seconds is way to
long :).

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] stopmachine: add stopmachine_timeout

2008-07-14 Thread Max Krasnyansky


Hidetoshi Seto wrote:
 Heiko Carstens wrote:
 Hmm.. probably a stupid question: but what could happen that a real cpu
 (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
 prioritized task for 5 seconds?
 
 The original problem (once I heard and easily reproduced) was there was an
 another MAX_RT_PRIO-1 task and the task was spinning in itself by a bug.
 (Now this would not be a problem since RLIMIT_RTTIME will work for it, but
 I cannot deny that there are some situations which cannot set the limit.)
Yep. As I described in the prev email in my case it's a legitimate thing. Some
of the CPU cores are running wireless basestation schedulers and the deadlines
are way too tight for them to sleep (it's cpu as a dedicated engine kind of
thing, they are properly isolated and stuff).

In this case actually RT limit is the first thing that I disable :).
I'd rather have stop_machine fail and tell the user that something is wrong.
In which case they can simply stop the basestation app that is running when
convinient. ie It give control back to the user rather than wedging the box or
killing the app.

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] tun: Fix/rewrite packet filtering logic

2008-07-12 Thread Max Krasnyansky
Please see the following thread to get some context on this
http://marc.info/?l=linux-netdevm=121564433018903w=2

Basically the issue is that current multi-cast filtering stuff in
the TUN/TAP driver is seriously broken.
Original patch went in without proper review and ACK. It was broken and
confusing to start with and subsequent patches broke it completely.
To give you an idea of what's broken here are some of the issues:

- Very confusing comments throughout the code that imply that the
character device is a network interface in its own right, and that packets
are passed between the two nics. Which is completely wrong.

- Wrong set of ioctls is used for setting up filters. They look like
shortcuts for manipulating state of the tun/tap network interface but
in reality manipulate the state of the TX filter.

- ioctls that were originally used for setting address of the the TX filter
got fixed and now set the address of the network interface itself. Which
made filter totaly useless.

- Filtering is done too late. Instead of filtering early on, to avoid
unnecessary wakeups, filtering is done in the read() call.

The list goes on and on :)

So the patch cleans all that up. It introduces simple and clean interface for
setting up TX filters (TUNSETTXFILTER + tun_filter spec) and does filtering
before enqueuing the packets.

TX filtering is useful in the scenarios where TAP is part of a bridge, in
which case it gets all broadcast, multicast and potentially other packets when
the bridge is learning. So for example Ethernet tunnelling app may want to
setup TX filters to avoid tunnelling multicast traffic. QEMU and other
hypervisors can push RX filtering that is currently done in the guest into the
host context therefore saving wakeups and unnecessary data transfer.

This is on top of the latest and greatest :). Assuming virt folks are ok with
the API this should go into 2.6.27.

Signed-off-by: Max Krasnyansky [EMAIL PROTECTED]
---
 drivers/net/tun.c  |  316 +++-
 include/linux/if_tun.h |   24 +++-
 2 files changed, 174 insertions(+), 166 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2693f88..901551c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -18,15 +18,11 @@
 /*
  *  Changes:
  *
- *  Brian Braunstein [EMAIL PROTECTED] 2007/03/23
- *Fixed hw address handling.  Now net_device.dev_addr is kept consistent
- *with tun.dev_addr when the address is set by this module.
- *
  *  Mike Kershaw [EMAIL PROTECTED] 2005/08/14
  *Add TUNSETLINK ioctl to set the link encapsulation
  *
  *  Mark Smith [EMAIL PROTECTED]
- *   Use random_ether_addr() for tap MAC address.
+ *Use random_ether_addr() for tap MAC address.
  *
  *  Harald Roelle [EMAIL PROTECTED]  2004/04/20
  *Fixes in packet dropping, queue length setting and queue wakeup.
@@ -83,9 +79,16 @@ static int debug;
 #define DBG1( a... )
 #endif
 
+#define FLT_EXACT_COUNT 8
+struct tap_filter {
+   unsigned intcount;/* Number of addrs. Zero means disabled */
+   u32 mask[2];  /* Mask of the hashed addrs */
+   unsigned char   addr[FLT_EXACT_COUNT][ETH_ALEN];
+};
+
 struct tun_struct {
struct list_headlist;
-   unsigned long   flags;
+   unsigned intflags;
int attached;
uid_t   owner;
gid_t   group;
@@ -94,19 +97,119 @@ struct tun_struct {
struct sk_buff_head readq;
 
struct net_device   *dev;
+   struct fasync_struct*fasync;
 
-   struct fasync_struct*fasync;
-
-   unsigned long if_flags;
-   u8 dev_addr[ETH_ALEN];
-   u32 chr_filter[2];
-   u32 net_filter[2];
+   struct tap_filter   txflt;
 
 #ifdef TUN_DEBUG
int debug;
 #endif
 };
 
+/* TAP filterting */
+static void addr_hash_set(u32 *mask, const u8 *addr)
+{
+   int n = ether_crc(ETH_ALEN, addr)  26;
+   mask[n  5] |= (1  (n  31));
+}
+
+static unsigned int addr_hash_test(const u32 *mask, const u8 *addr)
+{
+   int n = ether_crc(ETH_ALEN, addr)  26;
+   return mask[n  5]  (1  (n  31));
+}
+
+static int update_filter(struct tap_filter *filter, void __user *arg)
+{
+   struct { u8 u[ETH_ALEN]; } *addr;
+   struct tun_filter uf;
+   int err, alen, n, nexact;
+
+   if (copy_from_user(uf, arg, sizeof(uf)))
+   return -EFAULT;
+
+   if (!uf.count) {
+   /* Disabled */
+   filter-count = 0;
+   return 0;
+   }
+
+   alen = ETH_ALEN * uf.count;
+   addr = kmalloc(alen, GFP_KERNEL);
+   if (!addr)
+   return -ENOMEM;
+
+   if (copy_from_user(addr, arg + sizeof(uf), alen)) {
+   err = -EFAULT;
+   goto done;
+   }
+
+   /* The filter is updated without holding any locks. Which is
+* perfectly safe. We disable it first and in the worst

Re: Multicast and receive filtering in TUN/TAP

2008-07-11 Thread Max Krasnyansky
Rusty Russell wrote:
 On Friday 11 July 2008 12:20:07 Max Krasnyansky wrote:
 I haven't looked at the virtio stuff much, I was assuming that the host
 side of it is still the TUN driver. Is it not ?
 Yes, the host side is still tun/tap. The problem is that qemu doesnt know
 which multicast addresses are used inside the guest.
 Ah, now I see what you meant by virtio_net does not do multicast. I guess
 it should trivial to add. Rusty will clarify it I guess.
 
 Yes, it could certainly be added; that's what feature bits are for :)

Sounds good.
I'll send the patch that lets you guys setup tx filters on the TAP devices.
Hypervisors will then need to translate rx filters set by the guest OS into
TAP tx filters. I'm thinking of doing it just like E1000 for example. 14 exact
filters and the rest is hashed.

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Multicast and receive filtering in TUN/TAP

2008-07-10 Thread Max Krasnyansky
Christian Borntraeger wrote:
 Am Donnerstag, 10. Juli 2008 schrieb Max Krasnyansky:
 [...]
 The second question is do you guys think that QEMU/KVM/LGUEST/etc would 
 benefit if receive filtering was done by the host OS. Here is a specific 
 example of what I'm talking about.
 We can do what qemu/hw/e1000.c:receive_filter() does in the _host_ 
 context (that function currently runs in the guest context). By looking 
 at libvirt, typical QEMU based setup is that you have a single bridge 
 and all the TAPs from different VMs are hooked up to that bridge. What 
 that means is that if one VM is getting MC traffic or when the bridge 
 sees MACADDR that is not in its tables the packets get delivered to all 
 the VMs. ie We have to wake all of the up only to so that they could 
 drop that packet. Instead, we could setup filters in the host's side of 
 the TAP device.
 Does that sound like something useful for QEMU/KVM ?
 If yes we can talk about the API. If not then I'll just nuke it.
 
 Max,
 
 I know that on s390 the shared OSA network card have multicast filter 
 capabilities. So I guess it is worthwile for a virtualization environments 
 with lots of guests. I also think, that this kind of filtering should be 
 straightforward to implement with the qemu e1000 code. Qemu already knows the 
 multicast addresses.
Sure. It's straightforward to do inside QEMU, and it's already doing it.
The question is should we do it in the host context instead and avoid some
wakeups.

 Thing is, we are heading towards virtio. 
Even for Windows ?

 Unfortunately, virtio_net currently  does not offer a method to register 
 multicast addresses.
I haven't looked at the virtio stuff much, I was assuming that the host side
of it is still the TUN driver. Is it not ?

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Multicast and receive filtering in TUN/TAP

2008-07-10 Thread Max Krasnyansky


Brian Braunstein wrote:
 Sorry that I was confused here and it seems I am still confused.
 
 I was thinking that for any one instance of a TAP interface, there
 should be only 1 MAC address, since there is only 1 network interface,
 since the character device is not a network interface but rather the
 interface for the application to send and receive on that virtual
 network interface.
 
Exactly. Your understanding is perfectly correct.
See my previous reply. It should clear up all the confusion.


 For the MC stuff, I have to admit I haven't looked into it much, but it
 seems like the basic operation of setting the MAC address of the network
 interface should be supported, and it seems like an ioctl called
 SIOCSIFHWADDR should Set the InterFace HardWare ADDRess.  Sorry if I was
 wrong about this.  It might be good to add a comment to SIOCSIFHWADDR
 that says This does not actually set the network interface hardware
 address, this is for multicast filtering or whatever it actually is
 suppose to do.  Or perhaps create a new ioctl that has something about
 multicast filtering in the name, and leave SIOCSIFHWADDR doing what it
 is doing now.
Yep. That's what I'm going to do (ie a different ioctl). Again see my prev
email. We're totally on the same page :).

Max







 
 brian
 
 
 On Thu, Jul 10, 2008 at 2:38 PM, Shaun Jackman [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:
 
 Hi Max,
 
 The original patch implemented receive multicast filtering by
 emulating the implementation used by many physical Ethernet
 interfaces: hashing the multicast address. TUN emulates two network
 cards (and communication via the virtual link between them), the guest
 and the host, or the character device and the network device, so there
 are two receive filters: chr_filter and net_filter. I implemented the
 filtering at the character device using chr_filter in tun_chr_readv,
 and left filtering at the network device for someone else to
 implement.
 
 I'm not sure what you mean by TX filtering. Multicast filtering is
 implemented uniquely at the receiver. There are, however, two
 receivers: the character device and the network device.
 
 I believe Brian's patch was mistaken. Two entirely distinct Ethernet
 addresses are required: one for the character device and one for the
 network device, or put another way, one for the virtual Ethernet
 interface at the guest and one for the virtual Ethernet interface at
 the host. For the same reason, there are two distinct multicast
 filters.
 
 
 
 Looking over the original patch, I believe I see a bug in
 tun_net_mclist:
 memset(tun-chr_filter, 0, sizeof tun-chr_filter);
 should be
 memset(tun-net_filter, 0, sizeof tun-net_filter);
 
 Cheers,
 Shaun
 
 On Wed, Jul 9, 2008 at 3:58 PM, Max Krasnyansky [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:
  Yesterday while fixing xoff stuckiness issue in the TUN/TAP driver
 I got a
  chance to look into the multicast filtering code in there. And
 immediately
  realized how terribly broken  confusing it is. The patch was
 originally
  done by Shaun (CC'ed) and went in without any proper ACK from me,
 Dave or
  Jeff.
  Here is the original ref
 http://marc.info/?l=linux-netdevm=110490502102308w=2
 http://marc.info/?l=linux-netdevm=110490502102308w=2
 
  I'm not going to dive into too much details on what's wrong with
 the current
  code. The main issues are that it mixes RX and TX filtering which are
  orthogonal, and it reuses ioctl names and stuff for manipulating
 TX filter
  state as if it was a normal RX multicast state.
  Later on Brian's patch added insult to the injury
 http://git.kernel.org/?p=linux/kernel/git/\
 http://git.kernel.org/?p=linux/kernel/git/%5C
 torvalds/linux-2.6.git;\
 a=commit;h=36226a8ded46b89a94f9de5976f554bb5e02d84c
  Brian missed the point of the original patch (not his fault, as I
 said the
  original patch was not the best) that the separate address
 introduced by the
  MC patch was used for filtering _TX_ packets. It had nothing to do
 with the
  HW addr of the local network interface.
 
  The problem is that MC stuff is now even more broken and ioctls
 that were
  used originally now mean something different. So my first thinking
 was to
  just rip the MC stuff out because it's broken and probably nobody
 uses it
  (given that we got no complains after Brian's patch broke it
 completely).
  But then I realized that if done properly it might be very useful for
  virtualization.
 
  ---
 
  So the first question is are there any users out there that ever
 used the
  original patch. Shaun, any insight ? How did you intend to use

Multicast and receive filtering in TUN/TAP

2008-07-09 Thread Max Krasnyansky
Yesterday while fixing xoff stuckiness issue in the TUN/TAP driver I got 
a chance to look into the multicast filtering code in there. And 
immediately realized how terribly broken  confusing it is. The patch 
was originally done by Shaun (CC'ed) and went in without any proper ACK 
from me, Dave or Jeff.
Here is the original ref
http://marc.info/?l=linux-netdevm=110490502102308w=2

I'm not going to dive into too much details on what's wrong with the 
current code. The main issues are that it mixes RX and TX filtering 
which are orthogonal, and it reuses ioctl names and stuff for 
manipulating TX filter state as if it was a normal RX multicast state.
Later on Brian's patch added insult to the injury
http://git.kernel.org/?p=linux/kernel/git/\
torvalds/linux-2.6.git;\
a=commit;h=36226a8ded46b89a94f9de5976f554bb5e02d84c
Brian missed the point of the original patch (not his fault, as I said 
the original patch was not the best) that the separate address 
introduced by the MC patch was used for filtering _TX_ packets. It had 
nothing to do with the HW addr of the local network interface.

The problem is that MC stuff is now even more broken and ioctls that 
were used originally now mean something different. So my first thinking 
was to just rip the MC stuff out because it's broken and probably nobody 
uses it (given that we got no complains after Brian's patch broke it 
completely). But then I realized that if done properly it might be very 
useful for virtualization.

---

So the first question is are there any users out there that ever used 
the original patch. Shaun, any insight ? How did you intend to use it ?

---

The second question is do you guys think that QEMU/KVM/LGUEST/etc would 
benefit if receive filtering was done by the host OS. Here is a specific 
example of what I'm talking about.
We can do what qemu/hw/e1000.c:receive_filter() does in the _host_ 
context (that function currently runs in the guest context). By looking 
at libvirt, typical QEMU based setup is that you have a single bridge 
and all the TAPs from different VMs are hooked up to that bridge. What 
that means is that if one VM is getting MC traffic or when the bridge 
sees MACADDR that is not in its tables the packets get delivered to all 
the VMs. ie We have to wake all of the up only to so that they could 
drop that packet. Instead, we could setup filters in the host's side of 
the TAP device.
Does that sound like something useful for QEMU/KVM ?
If yes we can talk about the API. If not then I'll just nuke it.

Thanx
Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] tun: TUNSETFEATURES to set gso features.

2008-07-01 Thread Max Krasnyansky


Rusty Russell wrote:
 ethtool is useful for setting (some) device fields, but it's
 root-only.  Finer feature control is available through a tun-specific
 ioctl.
 
 (Includes Mark McLoughlin [EMAIL PROTECTED]'s fix to hold rtnl sem).
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]


Agree. And it's often way more convenient to muck with the TUN settings
directly on the fd instead of using external tool.

Acked-by: Max Krasnyansky [EMAIL PROTECTED]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] tun: Interface to query tun/tap features.

2008-07-01 Thread Max Krasnyansky

Rusty Russell wrote:
 The problem with introducing checksum offload and gso to tun is they
 need to set dev-features to enable GSO and/or checksumming, which is
 supposed to be done before register_netdevice(), ie. as part of
 TUNSETIFF.
 
 Unfortunately, TUNSETIFF has always just ignored flags it doesn't
 understand, so there's no good way of detecting whether the kernel
 supports new IFF_ flags.
 
 This patch implements a TUNGETFEATURES ioctl which returns all the valid IFF
 flags.  It could be extended later to include other features.

snip

Looks good.

Dave, do you want me to put all outstanding TUN patches into a git tree so
that you can pull them in one shot ?
Otherwise if you're ok with applying them one by one please apply this one.

Acked-by: Max Krasnyansky [EMAIL PROTECTED]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] tun: Allow GSO using virtio_net_hdr

2008-07-01 Thread Max Krasnyansky


Rusty Russell wrote:
 Add a IFF_VNET_HDR flag.  This uses the same ABI as virtio_net (ie. prepending
 struct virtio_net_hdr to packets) to indicate GSO and checksum information.
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]
 ---
  drivers/net/tun.c  |   90 
 -
  include/linux/if_tun.h |2 +
  2 files changed, 91 insertions(+), 1 deletion(-)
 
 diff -r d94590c1550a drivers/net/tun.c
 --- a/drivers/net/tun.c   Thu Jun 26 00:21:11 2008 +1000
 +++ b/drivers/net/tun.c   Thu Jun 26 00:21:59 2008 +1000
 @@ -63,6 +63,7 @@
  #include linux/if_tun.h
  #include linux/crc32.h
  #include linux/nsproxy.h
 +#include linux/virtio_net.h
  #include net/net_namespace.h
  #include net/netns/generic.h
  
 @@ -283,12 +284,24 @@ static __inline__ ssize_t tun_get_user(s
   struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
   struct sk_buff *skb;
   size_t len = count, align = 0;
 + struct virtio_net_hdr gso = { 0 };
  
   if (!(tun-flags  TUN_NO_PI)) {
   if ((len -= sizeof(pi))  count)
   return -EINVAL;
  
   if(memcpy_fromiovec((void *)pi, iv, sizeof(pi)))
 + return -EFAULT;
 + }
 +
 + if (tun-flags  TUN_VNET_HDR) {
 + if ((len -= sizeof(gso))  count)
 + return -EINVAL;
 +
 + if (gso.hdr_len  len)
 + return -EINVAL;
 +
 + if (memcpy_fromiovec((void *)gso, iv, sizeof(gso)))
   return -EFAULT;
   }

Unless I'm missing something the 'if (gso.hdr_len  len)' must be after
memcpy_fromiovec().

 @@ -322,8 +335,45 @@ static __inline__ ssize_t tun_get_user(s
   break;
   };
  
 - if (tun-flags  TUN_NOCHECKSUM)
 + if (gso.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 + if (!skb_partial_csum_set(skb, gso.csum_start,
 +   gso.csum_offset)) {
 + tun-dev-stats.rx_dropped++;
 + kfree_skb(skb);
 + return -EINVAL;
 + }
 + } else if (tun-flags  TUN_NOCHECKSUM)
   skb-ip_summed = CHECKSUM_UNNECESSARY;
 +
 + if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
 + pr_debug(GSO!\n);
 + switch (gso.gso_type  ~VIRTIO_NET_HDR_GSO_ECN) {
 + case VIRTIO_NET_HDR_GSO_TCPV4:
 + skb_shinfo(skb)-gso_type = SKB_GSO_TCPV4;
 + break;
 + case VIRTIO_NET_HDR_GSO_TCPV6:
 + skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6;
 + break;
 + default:
 + tun-dev-stats.rx_dropped++;
 + kfree_skb(skb);
 + return -EINVAL;
 + }

We should use stats.rx_frame_errors instead of stats.rx_dropped to indicated
that we dropped it because something was wrong with the framing (headers,
etc). Applies to both of the cases above.

 +
 + if (gso.gso_type  VIRTIO_NET_HDR_GSO_ECN)
 + skb_shinfo(skb)-gso_type |= SKB_GSO_TCP_ECN;
 +
 + skb_shinfo(skb)-gso_size = gso.gso_size;
 + if (skb_shinfo(skb)-gso_size == 0) {
 + tun-dev-stats.rx_dropped++;
 + kfree_skb(skb);
 + return -EINVAL;
 + }
Same here.

Everything else looks good.

Max

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/4] lguest: Use GSO/IFF_VNET_HDR extensions on tun/tap

2008-07-01 Thread Max Krasnyansky


Rusty Russell wrote:
 On Thursday 26 June 2008 05:07:18 Anthony Liguori wrote:
 Rusty Russell wrote:
 @@ -1563,6 +1561,16 @@ static void setup_tun_net(char *arg)
 /* Tell Guest what MAC address to use. */
 add_feature(dev, VIRTIO_NET_F_MAC);
 add_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY);
 +   /* Expect Guest to handle everything except UFO */
 +   add_feature(dev, VIRTIO_NET_F_CSUM);
 You're setting this feature twice.
 
 Hmm, not in the version here?
 
 +   add_feature(dev, VIRTIO_NET_F_GUEST_CSUM);
 You set this feature, but I never see the virtio-net driver acknowledge
 the feature.  Curiously, my implementation with KVM is struggling
 because UDP packet checksums are not correct so the DHCP client is
 ignoring them.  If I disable CSUM offload, things it works fine (using
 the virtio-net header).  The problem is only host=guest, guest=host is
 fine.
 
 OK, found this: wrong args to skb_partial_csum_set.  It was found by Mark 
 McLoughlin before, I just lost the fix when I extracted this into a separate 
 patch.  I chose to move the call to skb_partial_csum_set(), rather than use 
 his fix (which assumed a tap not tun device).
 
 Here's two fixes on top of previous patch:
 
 diff -u b/drivers/net/tun.c b/drivers/net/tun.c
 --- b/drivers/net/tun.c   Thu Jun 26 00:21:59 2008 +1000
 +++ b/drivers/net/tun.c   Thu Jun 26 14:35:03 2008 +1000
 @@ -298,11 +298,11 @@
   if ((len -= sizeof(gso))  count)
   return -EINVAL;
  
 - if (gso.hdr_len  len)
 - return -EINVAL;
 -
   if (memcpy_fromiovec((void *)gso, iv, sizeof(gso)))
   return -EFAULT;
 +
 + if (gso.hdr_len  len)
 + return -EINVAL;
   }

Yep, looks better now.

  
   if ((tun-flags  TUN_TYPE_MASK) == TUN_TAP_DEV) {
 @@ -324,6 +324,16 @@
   return -EFAULT;
   }
  
 + if (gso.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 + if (!skb_partial_csum_set(skb, gso.csum_start,
 +   gso.csum_offset)) {
 + tun-dev-stats.rx_dropped++;
 + kfree_skb(skb);
 + return -EINVAL;
 + }
 + } else if (tun-flags  TUN_NOCHECKSUM)
 + skb-ip_summed = CHECKSUM_UNNECESSARY;
 +
   switch (tun-flags  TUN_TYPE_MASK) {
   case TUN_TUN_DEV:
   skb_reset_mac_header(skb);
 @@ -335,16 +345,6 @@
   break;
   };
  
 - if (gso.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 - if (!skb_partial_csum_set(skb, gso.csum_start,
 -   gso.csum_offset)) {
 - tun-dev-stats.rx_dropped++;
 - kfree_skb(skb);
 - return -EINVAL;
 - }
 - } else if (tun-flags  TUN_NOCHECKSUM)
 - skb-ip_summed = CHECKSUM_UNNECESSARY;
 -
   if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE) {
   pr_debug(GSO!\n);
   switch (gso.gso_type  ~VIRTIO_NET_HDR_GSO_ECN) {

Do you want to resent the GSO patch with all the latest fixes ? ie other
things (stat counters) I pointed out in the prev email.
I'll ack it.

Thanx
Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 3/5] tun: vringfd receive support.

2008-04-08 Thread Max Krasnyansky
Rusty Russell wrote:
 This patch modifies tun to allow a vringfd to specify the receive
 buffer.  Because we can't copy to userspace in bh context, we queue
 like normal then use the pull hook to actually do the copy.
 
 More thought needs to be put into the possible races with ring
 registration and a simultaneous close, for example (see FIXME).
 
 We use struct virtio_net_hdr prepended to packets in the ring to allow
 userspace to receive GSO packets in future (at the moment, the tun
 driver doesn't tell the stack it can handle them, so these cases are
 never taken).

In general the code looks good. The only thing I could not convince myself in
is whether having generic ring buffer makes sense or not.
At least the TUN driver would be more efficient if it had its own simple ring
 implementation. Less indirection, fewer callbacks, fewer if()s, etc. TUN
already has the file descriptor and having two additional fds for rx and tx
ring is a waste (think of a VPN server that has to have a bunch of TUN fds).
Also as I mentioned before Jamal and I wanted to expose some of the SKB fields
through TUN device. With the rx/tx rings the natural way of doing that would
be the ring descriptor itself. It can of course be done the same way we copy
proto info (PI) and GSO stuff before the packet but that means more
copy_to_user() calls and yet more checks.

So. What am I missing ? Why do we need generic ring for the TUN ? I looked at
the lguest code a bit and it seems that we need a bunch of network specific
code anyway. The cool thing is that you can now mmap the rings into the guest
 directly but the same thing can be done with TUN specific rings.

Max



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-03-03 Thread Max Krasnyansky
Rusty Russell wrote:
 On Friday 08 February 2008 16:39:03 Max Krasnyansky wrote:
 Rusty Russell wrote:
 (Changes since last time: we how have explicit IFF_RECV_CSUM and
 IFF_RECV_GSO bits, and some renaming of virtio_net hdr)

 We use the virtio_net_hdr: it is an ABI already and designed to
 encapsulate such metadata as GSO and partial checksums.

 IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
 at the start of each packet.  You can always write packets with
 partial checksum and gso to the tap device using this header.

 IFF_RECV_CSUM means you can handle reading packets with partial
 checksums.  If IFF_RECV_GSO is also set, it means you can handle
 reading (all types of) GSO packets.

 Note that there is no easy way to detect if these flags are supported:
 see next patch.
 Again sorry for delay in replying. Here are my thoughts on this.

 I like the approach in general. Certainly the part that creates skbs out of
 the user-space pages looks good. And it's fits nicely into existing TUN
 driver model. However I actually wanted to change the model :). In
 particular I'm talking about syscall per packet
 After messing around with things like libe1000.sf.net I'd like to make
 TUN/TAP driver look more like modern nic's to the user-space. In other
 words I'm thinking about introducing RX and TX rings that the user-space
 can then mmap() and write/read packets descriptors to/from. That will saves
 the number of system calls that the user-space app needs to do. That by
 itself saves a lot of overhead, combined with the GSO it's be lightning
 fast.
 
 The problem with this approach is that for what I'm doing, the packets aren't 
 nicely arranged somewhere; they're in random process memory.
That's fine. RX/TX descriptors would not contain the data itself. They'd
contain pointers to actual packets (ie just like the NIC takes physical memory
address and DMAs data in/out).
The allows for sending/receiving packets without syscalls and fits nicely with
the async schemes like GSO.

btw The code that I sent you does indeed expect packets to be in a mmap()ed
buffer but I agree that it only works for certain cases. In general it's not
flexible. I was thinking of introducing some flags in the descriptor that tell
the kernel how to handle the packet. ie Whether it needs to be just copied
into a fresh SKB or remapped with get_user_pages().

 I thought about further abusing writev and readv to do multiple packets at 
 once.  
I actually was going to abuse them from day one. At that time Alex Kuznetsov
told me that I'm crazy and I gave up on it :)

 Also btw why call it VIRTIO ? For example I'm actually interested in
 speeding up tunning and general network apps. We have wireless basestation
 apps here that need to handle packets in user-space. Those kind things have
 nothing to with virtualization.
 
 The structure is for virtio, I'm just borrowing it for tap because it's 
 already there.  We could rename it and move it out to its own header, but if 
 so we should do that before 2.6.25 is released.
If we do the whole enchilada with the RX/TX rings then we probably do not even
need it. I'm thinking that RX/TX descriptor would include everything you need
for the GSO and stuff.
I meant do not need it for the TUN/TAP driver that is. Is it used anywhere else 
?

Max



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] partial checksum and GSO support for tun/tap.

2008-02-08 Thread Max Krasnyansky
Rusty Russell wrote:
 (Changes since last time: we how have explicit IFF_RECV_CSUM and 
 IFF_RECV_GSO bits, and some renaming of virtio_net hdr)
 
 We use the virtio_net_hdr: it is an ABI already and designed to
 encapsulate such metadata as GSO and partial checksums.
 
 IFF_VIRTIO_HDR means you will write and read a 'struct virtio_net_hdr'
 at the start of each packet.  You can always write packets with
 partial checksum and gso to the tap device using this header.
 
 IFF_RECV_CSUM means you can handle reading packets with partial
 checksums.  If IFF_RECV_GSO is also set, it means you can handle
 reading (all types of) GSO packets.

 Note that there is no easy way to detect if these flags are supported:
 see next patch.

Again sorry for delay in replying. Here are my thoughts on this.

I like the approach in general. Certainly the part that creates skbs out of the 
user-space
pages looks good. And it's fits nicely into existing TUN driver model.
However I actually wanted to change the model :). In particular I'm talking 
about 
syscall per packet 
After messing around with things like libe1000.sf.net I'd like to make TUN/TAP 
driver look 
more like modern nic's to the user-space. In other words I'm thinking about 
introducing RX and
TX rings that the user-space can then mmap() and write/read packets descriptors 
to/from.
That will saves the number of system calls that the user-space app needs to do. 
That by 
itself saves a lot of overhead, combined with the GSO it's be lightning fast.

I'm going to send you a version that I cooked up awhile ago in a private email. 
Do not want
to spam netdev :). It's not quite the RX/TX ring model but I'll give you an 
idea.
I did some profiling and PPS (packets per second) numbers that user-space can 
handle literally 
sky rocketed.

btw We had a long discussion with Eugeniy Polakov on mapping user-pages vs 
mmap()ing large
kernel buffer and doing normal memcpy() (ie instead of copy_to/fromuser()) in 
the kernel.
On small packets overhead of get_user_pages() eats up all the benefits. So we 
should think
of some scheme that nicely combines the two. Kind of like copy break that 
latest net 
drivers do these days.

Also btw why call it VIRTIO ? For example I'm actually interested in speeding 
up tunning
and general network apps. We have wireless basestation apps here that need to 
handle packets
in user-space. Those kind things have nothing to with virtualization.

Max
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization