Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thu, Aug 20, 2009 at 04:38:17PM +0300, Michael S. Tsirkin wrote: On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote: On Thursday 20 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote: On Wednesday 19 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote: On Wednesday 19 August 2009, Michael S. Tsirkin wrote: Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET, VHOST_SET_OWNER, VHOST_RESET_OWNER SET/RESET OWNER is still needed: otherwise if you share a descriptor with another process, it can corrupt your memory. How? The point of using user threads is that you only ever access the address space of the thread that called the ioctl. Think about this example with processes A and B sharing an fd: A does SET_USED_ADDRESS B does SET_USED_ADDRESS A does VHOST_NET_SPLICE See how stuff gets written into a random place in memory of A? Yes, I didn't think of that. It doesn't seem like a big problem though, because it's a clear misuse of the API (I guess your current code returns an error for one of the SET_USED_ADDRESS ioctls), so I would see it as a classic garbage-in garbage-out case. It may even work in the case that the sharing of the fd resulted from a fork, where the address contains the same buffer in both processes. I can't think of a reason why you would want to use it like that though. It doesn't matter that I don't want this: allowing 1 process corrupt another's memory is a security issue. Once you get an fd, you want to be able to use it without worrying that a bug in another process will crash yours. If B's SET_USED_ADDRESS fails how one process can corrupt a memory of other process? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thursday 20 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote: On Wednesday 19 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote: On Wednesday 19 August 2009, Michael S. Tsirkin wrote: Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET, VHOST_SET_OWNER, VHOST_RESET_OWNER SET/RESET OWNER is still needed: otherwise if you share a descriptor with another process, it can corrupt your memory. How? The point of using user threads is that you only ever access the address space of the thread that called the ioctl. Think about this example with processes A and B sharing an fd: A does SET_USED_ADDRESS B does SET_USED_ADDRESS A does VHOST_NET_SPLICE See how stuff gets written into a random place in memory of A? Yes, I didn't think of that. It doesn't seem like a big problem though, because it's a clear misuse of the API (I guess your current code returns an error for one of the SET_USED_ADDRESS ioctls), so I would see it as a classic garbage-in garbage-out case. It may even work in the case that the sharing of the fd resulted from a fork, where the address contains the same buffer in both processes. I can't think of a reason why you would want to use it like that though. Why would I wake up the threads spuriously? Do you mean for stopping the transmission or something else? I guess a pthread_kill would be enough for shutting it down. If you kill and restart them you lost priority etc parameters, but maybe. If you want to restart it, just send a nonfatal signal (SIGUSR1, SIGRTMIN, ...) instead of a SIGKILL. More importantly, you lose control of CPU locality. Simply put, a natural threading model in virtualization is one thread per guest vcpu. Asking applications to add multiple helper threads just so they can block forever is wrong, IMO, as userspace has no idea which CPU they should be on, what priority to use, etc. But the kernel also doesn't know this, you get the same problem in another way. If you have multiple guests running at different priorities, the kernel will use those priorities to do the more important transfers first, while with a global workqueue every guest gets the same priority. We could create more threads if this becomes a problem. I just think it should be transparent to userspace. Possibly it's useful to look at the packet header as well to decide on priority: this is something userspace can't do. Being transparent to user space would be nice, I agree. Letting user space choose would also be nice, e.g. if you want to distribute eight available hardware queue pairs to three guests in a non-obvious way. The implementation depends to some degree on how we want to do multiqueue RX/TX in virtio-net in the long run. For best cache locality and NUMA behaviour, we might want to have one virtqueue per guest CPU and control them independently from the host. Priorities of the packets are dealt with in the packet scheduler for external interfaces, I guess that is sufficient. I'm not sure if we need to honor the same priorities for guest-to-guest communication, my feeling is that we don't need to. You say that the natural model is to have one thread per guest CPU, Sorry I was not clear. I think userspace should create thread per guest. We can create as many as we need for networking but I think this should be behind the scenes, so userspace shouldn't bother with host CPUs, it will just get it wrong. Think of CPU hotplug, interrupts migrating between CPUs, etc ... Yes, I hope we can avoid letting qemu know about host CPUs. I'm not sure we can avoid it completely, because something needs to set physical IRQ affinity and such for the virtual devices if you want to get the best locality. but you have a thread per host CPU instead. If the numbers are different, you probably lose either way. The trick I used is to keep as much as possible local TX done on the CPU that runs the guest, RX done on the CPU that runs the NIC interrupt. a smart SMP guest sees which cpu gets interrupts from NIC and schedules RX there, and it shouldn't matter if the numbers of CPUs are different. yes, that sounds good. It gets worse if you try to apply NUMA policies. I believe the best we could do is avoid switching CPUs until we know the actual destination. My point is that the RX data in the guest address space should be on the same NUMA node that gets the interrupt. And note that there might be more than one error. I guess, that's another problem with trying to layer on top of vfs. Why is that different from any other system call? With other system calls nothing happens while you process the error. Here, the guest (other queues) and the network keep running (unless there is a thread per queue, maybe we can block a queue, but we both agreed above we don't want that).
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thursday 20 August 2009, Michael S. Tsirkin wrote: On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote: On Thursday 20 August 2009, Michael S. Tsirkin wrote: It doesn't matter that I don't want this: allowing 1 process corrupt another's memory is a security issue. Once you get an fd, you want to be able to use it without worrying that a bug in another process will crash yours. Ok, got it. Yes, that would be inacceptable. If you assume losing the code for the second error condition is OK, why is the first one so important? That's why I used a counter (eventfd) per virtqueue, on error userspace can scan the ring and poll the socket and discover what's wrong, and counter ensures we can detect that error happened while we were not looking. I guess we were talking about different kinds of errors here, and I'm still not sure which one you are talking about. Non fatal errors. E.g. translation errors probably should be non-fatal. I can also imagine working around guest bugs in userspace. Ah, so I guess the confusion was that I was worried about errors coming from the socket file descriptor, while you were thinking of errors from the guest side, which I did not expect to happen. The errors from the socket (or chardev, as that was the start of the argument) should still fit into the categories that I mentioned, either they can be handled by the host kernel, or they are fatal. I'll read up in your code to see how you handle asynchronous non-fatal errors from the guest. Intuitively, I'd still assume that returning the first error should be enough because it will typically mean that you cannot continue without fixing it up first, and you might get the next error immediately after that. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thu, Aug 20, 2009 at 04:31:36PM +0200, Arnd Bergmann wrote: On Thursday 20 August 2009, Michael S. Tsirkin wrote: On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote: On Thursday 20 August 2009, Michael S. Tsirkin wrote: It doesn't matter that I don't want this: allowing 1 process corrupt another's memory is a security issue. Once you get an fd, you want to be able to use it without worrying that a bug in another process will crash yours. Ok, got it. Yes, that would be inacceptable. If you assume losing the code for the second error condition is OK, why is the first one so important? That's why I used a counter (eventfd) per virtqueue, on error userspace can scan the ring and poll the socket and discover what's wrong, and counter ensures we can detect that error happened while we were not looking. I guess we were talking about different kinds of errors here, and I'm still not sure which one you are talking about. Non fatal errors. E.g. translation errors probably should be non-fatal. I can also imagine working around guest bugs in userspace. Ah, so I guess the confusion was that I was worried about errors coming from the socket file descriptor, while you were thinking of errors from the guest side, which I did not expect to happen. The errors from the socket (or chardev, as that was the start of the argument) should still fit into the categories that I mentioned, either they can be handled by the host kernel, or they are fatal. Hmm, are you sure? Imagine a device going away while socket is bound to it. You get -ENXIO. It's not fatal in a sense that you can bind the socket to another device and go on, right? I'll read up in your code to see how you handle asynchronous non-fatal errors from the guest. Intuitively, I'd still assume that returning the first error should be enough because it will typically mean that you cannot continue without fixing it up first, and you might get the next error immediately after that. Arnd Yes, but need to be careful not to lose that next error, and these errors only block one queue. I handle this by reporting them on an eventfd, per vq, which userspace can poll. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thursday 20 August 2009, Michael S. Tsirkin wrote: The errors from the socket (or chardev, as that was the start of the argument) should still fit into the categories that I mentioned, either they can be handled by the host kernel, or they are fatal. Hmm, are you sure? Imagine a device going away while socket is bound to it. You get -ENXIO. It's not fatal in a sense that you can bind the socket to another device and go on, right? Right. Not fatal in that sense, but fatal in the sense that I can no longer transmit other frames until you recover. I think we both meant the same here. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Sunday 16 August 2009, Michael S. Tsirkin wrote: On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote: * most of the transports are sockets, tap uses a character device. This could be dealt with by having both a struct socket * in struct vhost_net *and* a struct file *, or by always keeping the struct file and calling vfs_readv/vfs_writev for the data transport in both cases. I am concerned that character devices might have weird side effects with read/write operations and that calling them from kernel thread the way I do might have security implications. Can't point at anything specific though at the moment. I understand your feelings about passing a chardev fd into your driver and I agree that we need to be very careful if we want to allow it. Maybe we could instead extend the 'splice' system call to work on a vhost_net file descriptor. If we do that, we can put the access back into a user thread (or two) that stays in splice indefinetely to avoid some of the implications of kernel threads like the missing ability to handle transfer errors in user space. I wonder - can we expose the underlying socket used by tap, or will that create complex lifetime issues? I think this could get more messy in the long run than calling vfs_readv on a random fd. It would mean deep internal knowledge of the tap driver in vhost_net, which I really would prefer to avoid. * Each transport has a slightly different header, we have - raw ethernet frames (raw, udp multicast, tap) - 32-bit length + raw frames, possibly fragmented (tcp) - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr) To handle these three cases, we need either different ioctl numbers so that vhost_net can choose the right one, or a flags field in VHOST_NET_SET_SOCKET, like #define VHOST_NET_RAW 1 #define VHOST_NET_LEN_HDR 2 #define VHOST_NET_VNET_HDR4 struct vhost_net_socket { unsigned int flags; int fd; }; #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket) It seems we can query the socket to find out the type, yes, I understand that you can do that, but I still think that decision should be left to user space. Adding a length header for TCP streams but not for UDP is something that we would normally want to do, but IMHO vhost_net should not need to know about this. or use the features ioctl. Right, I had forgotten about that one. It's probably equivalent to the flags I suggested, except that one allows you to set features after starting the communication, while the other one prevents you from doing that. Qemu could then automatically try to use vhost_net, if it's available in the kernel, or just fall back on software vlan otherwise. Does that make sense? I agree, long term it should be enabled automatically when possible. So how about making the qemu command line interface an extension to what Or Gerlitz has done for the raw packet sockets? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 19, 2009 at 11:04:50AM +0200, Arnd Bergmann wrote: On Sunday 16 August 2009, Michael S. Tsirkin wrote: On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote: * most of the transports are sockets, tap uses a character device. This could be dealt with by having both a struct socket * in struct vhost_net *and* a struct file *, or by always keeping the struct file and calling vfs_readv/vfs_writev for the data transport in both cases. I am concerned that character devices might have weird side effects with read/write operations and that calling them from kernel thread the way I do might have security implications. Can't point at anything specific though at the moment. I understand your feelings about passing a chardev fd into your driver and I agree that we need to be very careful if we want to allow it. Maybe we could instead extend the 'splice' system call to work on a vhost_net file descriptor. If we do that, we can put the access back into a user thread (or two) that stays in splice indefinetely An issue with exposing internal threading model to userspace in this way is that we lose control of e.g. CPU locality - and it is very hard for userspace to get it right. to avoid some of the implications of kernel threads like the missing ability to handle transfer errors in user space. Are you talking about TCP here? Transfer errors are typically asynchronous - possibly eventfd as I expose for vhost net is sufficient there. I wonder - can we expose the underlying socket used by tap, or will that create complex lifetime issues? I think this could get more messy in the long run than calling vfs_readv on a random fd. It would mean deep internal knowledge of the tap driver in vhost_net, which I really would prefer to avoid. No, what I had in mind is adding a GET_SOCKET ioctl to tap. vhost would then just use the socket. * Each transport has a slightly different header, we have - raw ethernet frames (raw, udp multicast, tap) - 32-bit length + raw frames, possibly fragmented (tcp) - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr) To handle these three cases, we need either different ioctl numbers so that vhost_net can choose the right one, or a flags field in VHOST_NET_SET_SOCKET, like #define VHOST_NET_RAW 1 #define VHOST_NET_LEN_HDR 2 #define VHOST_NET_VNET_HDR 4 struct vhost_net_socket { unsigned int flags; int fd; }; #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket) It seems we can query the socket to find out the type, yes, I understand that you can do that, but I still think that decision should be left to user space. Adding a length header for TCP streams but not for UDP is something that we would normally want to do, but IMHO vhost_net should not need to know about this. or use the features ioctl. Right, I had forgotten about that one. It's probably equivalent to the flags I suggested, except that one allows you to set features after starting the communication, while the other one prevents you from doing that. Qemu could then automatically try to use vhost_net, if it's available in the kernel, or just fall back on software vlan otherwise. Does that make sense? I agree, long term it should be enabled automatically when possible. So how about making the qemu command line interface an extension to what Or Gerlitz has done for the raw packet sockets? Arnd Not sure I see the connection, but I have not thought about qemu side of things too much yet - trying to get kernel bits in place first so that there's a stable ABI to work with. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Wednesday 19 August 2009, Michael S. Tsirkin wrote: Maybe we could instead extend the 'splice' system call to work on a vhost_net file descriptor. If we do that, we can put the access back into a user thread (or two) that stays in splice indefinetely An issue with exposing internal threading model to userspace in this way is that we lose control of e.g. CPU locality - and it is very hard for userspace to get it right. Good point, I hadn't thought about that in this context. For macvtap, my idea was to open the same tap char device multiple times and use each fd exclusively on one *guest* CPU. I'm not sure if virtio-net can already handle SMP guests efficiently. We might actually need to extend it to have more pairs of virtqueues, one for each guest CPU, which can then be bound to a host queue (or queue pair) in the physical nic. Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET, VHOST_SET_OWNER, VHOST_RESET_OWNER and your kernel thread with a new VHOST_NET_SPLICE blocking ioctl that does all the transfers in the context of the calling thread. This would improve the driver on various fronts: - no need for playing tricks with use_mm/unuse_mm - possibly fewer global TLB flushes from switch_mm, which may improve performance. - ability to pass down error codes from socket or guest to user space by returning from ioctl - based on that, the ability to use any kind of file descriptor that can do writev/readv or sendmsg/recvmsg without the nastiness you mentioned. The disadvantage of course is that you need to add a user thread for each guest device to make up for the workqueue that you save. to avoid some of the implications of kernel threads like the missing ability to handle transfer errors in user space. Are you talking about TCP here? Transfer errors are typically asynchronous - possibly eventfd as I expose for vhost net is sufficient there. I mean errors in general if we allow random file descriptors to be used. E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS, EIO, EAGAIN and possibly others. We can handle some in kernel, others should never happen with vhost_net, but if something unexpected happens it would be nice to just bail out to user space. I wonder - can we expose the underlying socket used by tap, or will that create complex lifetime issues? I think this could get more messy in the long run than calling vfs_readv on a random fd. It would mean deep internal knowledge of the tap driver in vhost_net, which I really would prefer to avoid. No, what I had in mind is adding a GET_SOCKET ioctl to tap. vhost would then just use the socket. Right, that would work with tun/tap at least. It sounds a bit fishy but I can't see a reason why it would be hard to do. I'd have to think about how to get it working with macvtap, or if there is much value left in macvtap after that anyway. So how about making the qemu command line interface an extension to what Or Gerlitz has done for the raw packet sockets? Not sure I see the connection, but I have not thought about qemu side of things too much yet - trying to get kernel bits in place first so that there's a stable ABI to work with. Ok, fair enough. The kernel bits are obviously more time critical right now, since they should get into 2.6.32. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Wednesday 19 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote: On Wednesday 19 August 2009, Michael S. Tsirkin wrote: Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET, VHOST_SET_OWNER, VHOST_RESET_OWNER SET/RESET OWNER is still needed: otherwise if you share a descriptor with another process, it can corrupt your memory. How? The point of using user threads is that you only ever access the address space of the thread that called the ioctl. and your kernel thread with a new VHOST_NET_SPLICE blocking ioctl that does all the transfers in the context of the calling thread. For one, you'd want a thread per virtqueue. Don't understand why. The thread can be blocked on all four ends of the device and wake up whenever there is some work do to in any direction. If we have data to be transferred in both ways, we save one thread switch. It probably won't hurt much to have one thread per direction, but I don't see the point. Second, an incoming traffic might arrive on another CPU, we want to keep it local. I guess you would also want ioctls to wake up the threads spuriously ... Outbound traffic should just stay on whatever CPU was sending it from the guest. For inbound traffic, we should only wake up the thread on the CPU that got the data to start with. Why would I wake up the threads spuriously? Do you mean for stopping the transmission or something else? I guess a pthread_kill would be enough for shutting it down. This would improve the driver on various fronts: - no need for playing tricks with use_mm/unuse_mm - possibly fewer global TLB flushes from switch_mm, which may improve performance. Why would there be less flushes? I just read up on task-active_mm handling. There probably wouldn't be any. I got that wrong. - based on that, the ability to use any kind of file descriptor that can do writev/readv or sendmsg/recvmsg without the nastiness you mentioned. Yes, it's an interesting approach. As I said, need to tread very carefully though, I don't think all issues are figured out. For example: what happens if we pass our own fd here? Will refcount on file ever get to 0 on exit? There may be others ... right. The disadvantage of course is that you need to add a user thread for each guest device to make up for the workqueue that you save. More importantly, you lose control of CPU locality. Simply put, a natural threading model in virtualization is one thread per guest vcpu. Asking applications to add multiple helper threads just so they can block forever is wrong, IMO, as userspace has no idea which CPU they should be on, what priority to use, etc. But the kernel also doesn't know this, you get the same problem in another way. If you have multiple guests running at different priorities, the kernel will use those priorities to do the more important transfers first, while with a global workqueue every guest gets the same priority. You say that the natural model is to have one thread per guest CPU, but you have a thread per host CPU instead. If the numbers are different, you probably lose either way. It gets worse if you try to apply NUMA policies. to avoid some of the implications of kernel threads like the missing ability to handle transfer errors in user space. Are you talking about TCP here? Transfer errors are typically asynchronous - possibly eventfd as I expose for vhost net is sufficient there. I mean errors in general if we allow random file descriptors to be used. E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS, EIO, EAGAIN and possibly others. We can handle some in kernel, others should never happen with vhost_net, but if something unexpected happens it would be nice to just bail out to user space. And note that there might be more than one error. I guess, that's another problem with trying to layer on top of vfs. Why is that different from any other system call? We just return when we hit the first error condition. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote: On Thursday 13 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. AFAICT, you have addressed all my comments, mostly by convincing me that you got it right anyway ;-). I hope this gets into 2.6.32, good work! Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Arnd Bergmann a...@arndb.de One idea though: + /* Parameter checking */ + if (sock-sk-sk_type != SOCK_RAW) { + r = -ESOCKTNOSUPPORT; + goto done; + } + + r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa, + uaddr_len, 0); + if (r) + goto done; + + if (uaddr.sa.sll_family != AF_PACKET) { + r = -EPFNOSUPPORT; + goto done; + } You currently limit the scope of the driver by only allowing raw packet sockets to be passed into the network driver. In qemu, we currently support some very similar transports: * raw packet (not in a release yet) * tcp connection * UDP multicast * tap character device * VDE with Unix local sockets My primary interest right now is the tap support, but I think it would be interesting in general to allow different file descriptor types in vhost_net_set_socket. AFAICT, there are two major differences that we need to handle for this: * most of the transports are sockets, tap uses a character device. This could be dealt with by having both a struct socket * in struct vhost_net *and* a struct file *, or by always keeping the struct file and calling vfs_readv/vfs_writev for the data transport in both cases. I am concerned that character devices might have weird side effects with read/write operations and that calling them from kernel thread the way I do might have security implications. Can't point at anything specific though at the moment. I wonder - can we expose the underlying socket used by tap, or will that create complex lifetime issues? * Each transport has a slightly different header, we have - raw ethernet frames (raw, udp multicast, tap) - 32-bit length + raw frames, possibly fragmented (tcp) - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr) To handle these three cases, we need either different ioctl numbers so that vhost_net can choose the right one, or a flags field in VHOST_NET_SET_SOCKET, like #define VHOST_NET_RAW 1 #define VHOST_NET_LEN_HDR 2 #define VHOST_NET_VNET_HDR 4 struct vhost_net_socket { unsigned int flags; int fd; }; #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket) It seems we can query the socket to find out the type, or use the features ioctl. If both of those are addressed, we can treat vhost_net as a generic way to do network handling in the kernel independent of the qemu model (raw, tap, ...) for it. Your qemu patch would have to work differently, so instead of qemu -net nic,vhost=eth0 you would do the same as today with the raw packet socket extension qemu -net nic -net raw,ifname=eth0 Qemu could then automatically try to use vhost_net, if it's available in the kernel, or just fall back on software vlan otherwise. Does that make sense? Arnd I agree, long term it should be enabled automatically when possible. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. AFAICT, you have addressed all my comments, mostly by convincing me that you got it right anyway ;-). I hope this gets into 2.6.32, good work! Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Arnd Bergmann a...@arndb.de One idea though: + /* Parameter checking */ + if (sock-sk-sk_type != SOCK_RAW) { + r = -ESOCKTNOSUPPORT; + goto done; + } + + r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa, +uaddr_len, 0); + if (r) + goto done; + + if (uaddr.sa.sll_family != AF_PACKET) { + r = -EPFNOSUPPORT; + goto done; + } You currently limit the scope of the driver by only allowing raw packet sockets to be passed into the network driver. In qemu, we currently support some very similar transports: * raw packet (not in a release yet) * tcp connection * UDP multicast * tap character device * VDE with Unix local sockets My primary interest right now is the tap support, but I think it would be interesting in general to allow different file descriptor types in vhost_net_set_socket. AFAICT, there are two major differences that we need to handle for this: * most of the transports are sockets, tap uses a character device. This could be dealt with by having both a struct socket * in struct vhost_net *and* a struct file *, or by always keeping the struct file and calling vfs_readv/vfs_writev for the data transport in both cases. * Each transport has a slightly different header, we have - raw ethernet frames (raw, udp multicast, tap) - 32-bit length + raw frames, possibly fragmented (tcp) - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr) To handle these three cases, we need either different ioctl numbers so that vhost_net can choose the right one, or a flags field in VHOST_NET_SET_SOCKET, like #define VHOST_NET_RAW 1 #define VHOST_NET_LEN_HDR 2 #define VHOST_NET_VNET_HDR4 struct vhost_net_socket { unsigned int flags; int fd; }; #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket) If both of those are addressed, we can treat vhost_net as a generic way to do network handling in the kernel independent of the qemu model (raw, tap, ...) for it. Your qemu patch would have to work differently, so instead of qemu -net nic,vhost=eth0 you would do the same as today with the raw packet socket extension qemu -net nic -net raw,ifname=eth0 Qemu could then automatically try to use vhost_net, if it's available in the kernel, or just fall back on software vlan otherwise. Does that make sense? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html