Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Thu, 2007-04-12 at 06:32 +0300, Avi Kivity wrote: I hadn't considered an always-blocking (or unbuffered) networking API. It's very counter to current APIs, but does make sense with things like syslets. Without syslets, I don't think it's very useful as you need some artificial threads to keep things humming along. (How would userspace specify it? O_DIRECT when opening the tap?) TBH, I hadn't thought that far. Tap already has those IFF_NO_PI etc flags, but it might make sense to just be the default. From userspace's POV it's not a semantic change. OK, just tested: I can get 230,000 packets (28 byte UDP) through the tun device in a second (130,000 actually out the 100-base-T NIC, 100,000 dropped). If the tun driver's write() blocks until the skb is destroyed, it's 4,000 packets. So your intuition was right: skb_free latency on xmit (at least for this e1000) is far too large for anything but an async solution. Will ponder further. Thanks! Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Thu, 2007-04-12 at 06:32 +0300, Avi Kivity wrote: I hadn't considered an always-blocking (or unbuffered) networking API. It's very counter to current APIs, but does make sense with things like syslets. Without syslets, I don't think it's very useful as you need some artificial threads to keep things humming along. (How would userspace specify it? O_DIRECT when opening the tap?) TBH, I hadn't thought that far. Tap already has those IFF_NO_PI etc flags, but it might make sense to just be the default. From userspace's POV it's not a semantic change. OK, just tested: I can get 230,000 packets (28 byte UDP) through the tun device in a second (130,000 actually out the 100-base-T NIC, 100,000 dropped). If the tun driver's write() blocks until the skb is destroyed, it's 4,000 packets. So your intuition was right: skb_free latency on xmit (at least for this e1000) is far too large for anything but an async solution. Will ponder further. I think aio_write (but done copyless-lessly) is the way to go. Not only is the infrastructure there, but the API already allows for multiple packet submission and for batching completions. Fitting into that framework ought to be easier than starting yet another one. It still misses scatter/gather and integration with fd-based notification, but there are patches around for that. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote: Nope. Being async is critical for copyless networking: - in the transmit path, so need to stop the sender (guest) from touching the memory until it's on the wire. This means 100% of packets sent will be blocked. Hi Avi, You keep saying stuff like this, and I keep ignoring it. OK, I'll bite: Why would we try to prevent the sender from altering the packets? To avoid data corruption. The guest wants to send a packet. It calls write(), which causes an skb to be allocated, data to be copied into it, the entire networking stack gets into gear, and the guest-side driver instructs the device to send the packet. With async operations, the saga continues like this: the host-side driver allocates an skb, get_page()s and attaches the data to the new skb, this skb crosses the bridge, trickles into the real ethernet device, gets queued there, sent, interrupts fire, triggering async completion. On this completion, we send a virtual interrupt to the guest, which tells it to destroy the skb and reclaim the pages attached to it. Without async operations, we don't have a hook to notify the guest when to reclaim the skb. If we do it too soon, the skb can be reclaimed and the memory reused before the real device gets to see it, so we end up sending data that we did not intend. The only way to avoid it is to copy the data somewhere safe, but that is exactly what we don't want to do. - multiple packets per operation (for interrupt mitigation) (like lio_listio) The benefits for interrupt mitigation are less clear to me in a virtual environment (scheduling tends to make it happen anyway); I'd want to benchmark it. Yes, the guest will probably submit multiple packets in one hypercall. It would be nice for the userspace driver to be able to submit them to the host kernel in one syscall. Some kind of batching to reduce syscall overhead, perhaps, but TSO would go a fair way towards that anyway (probably not enough). For some workloads, sure. - scatter/gather packets (iovecs) Yes, and this is already present in the tap device. Anthony suggested a slightly nasty hack for multiple sg packets in one writev()/readv, which could also give us batching. No need for hacks if we get list aio support one day. - configurable wakeup (by packet count/timeout) for queue management I'm not convinced that this is a showstopper, though. It probably isn't. It's free with aio though. - hacks (tso) I'd usually go for a batch interface over TSO, but if the card we're sending to actually does TSO then TSO will probably win. Sure, if tso helps a regular host then it should help one that happens to be running a virtual machine. Most of these can be provided by a combination of the pending aio work, the pending aio/fd integration, and the not-so-pending tap aio work. As the first two are available as patches and the third is limited to the tap device, it is not unreasonable to try it out. Maybe it will turn out not to be as difficult as I predicted just a few lines above. Indeed, I don't think we're asking for a revolution a-la VJ-style channels. But I'm still itching to get back to that, and this might yet provide an excuse 8) I'll be happy if this can be made to work. It will make the paravirt guest-side driver work in kvm-less setups, which are useful for testing, and of course reduction in kernel code is beneficial. It will be slower that in-kernel, but if we get the batching right, perhaps not significantly slower. I'm mostly concerned that this depends on code that has eluded merging for such a long time. -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Wed, 2007-04-11 at 17:28 +0300, Avi Kivity wrote: Rusty Russell wrote: On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote: Nope. Being async is critical for copyless networking: With async operations, the saga continues like this: the host-side driver allocates an skb, get_page()s and attaches the data to the new skb, this skb crosses the bridge, trickles into the real ethernet device, gets queued there, sent, interrupts fire, triggering async completion. On this completion, we send a virtual interrupt to the guest, which tells it to destroy the skb and reclaim the pages attached to it. Hi Avi! Thanks for spelling it out, I now understand your POV. I had considered it obvious that a (non-async) write which didn't copy would block until the skb was finished with, which is easy to code up within the tap device itself. Otherwise it's actually an async write without a notification mechanism, which I agree is broken. Note though: if the guest can change the packet headers they can subvert some firewall rules and possibly crash the host. None of the networking code I wrote expects packets to change in flight 8( This applies to a userspace or kernelspace driver. Yes, and this is already present in the tap device. Anthony suggested a slightly nasty hack for multiple sg packets in one writev()/readv, which could also give us batching. No need for hacks if we get list aio support one day. As you point out though, aio is not something we want to hold our breath for. Plus, aio never makes things simpler, and complexity kills puppies. Cheers! Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Mon, Apr 09, 2007 at 04:38:18PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: But I don't get this we can enhance the kernel but not userspace vibe 8( I've been waiting for network aio since ~2003. If it arrives in the next few days, I'm all for it; much more than kvm can use it profitably. But I'm not going to write that interface myself. Hmm, you missed at least two implementations of network aio in the previous year, and now with syslets we can have third one. But it looks from this discussion, that it will not prevent from changing in-kernel driver - place a hook into skb allocation path and allocate data from opposing memory - get pages from another side and put them into fragments, then copy headers into skb-data. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Evgeniy Polyakov wrote: On Mon, Apr 09, 2007 at 04:38:18PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: But I don't get this we can enhance the kernel but not userspace vibe 8( I've been waiting for network aio since ~2003. If it arrives in the next few days, I'm all for it; much more than kvm can use it profitably. But I'm not going to write that interface myself. Hmm, you missed at least two implementations of network aio in the previous year, and now with syslets we can have third one. I meant, network aio in the mainline kernel. I am aware of the various out-of-tree implementations. But it looks from this discussion, that it will not prevent from changing in-kernel driver - place a hook into skb allocation path and allocate data from opposing memory - get pages from another side and put them into fragments, then copy headers into skb-data. I don't understand this (opposing memory, another side?). Can you elaborate? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Tue, Apr 10, 2007 at 11:19:52AM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: I meant, network aio in the mainline kernel. I am aware of the various out-of-tree implementations. If potential users do not pay attention to initial implementaion, it is quite hard to them to get into. But actually it does not matter to this discussion. But it looks from this discussion, that it will not prevent from changing in-kernel driver - place a hook into skb allocation path and allocate data from opposing memory - get pages from another side and put them into fragments, then copy headers into skb-data. I don't understand this (opposing memory, another side?). Can you elaborate? You want to implement zero-copy network device between host and guest, if I understood this thread correctly? So, for sending part, device allocates pages from receiver's memory (or from shared memory), receiver gets an 'interrupt' and got pages from own memory, which are attached to new skb and transferred up to the network stack. It can be extended to use shared ring of pages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Evgeniy Polyakov wrote: But it looks from this discussion, that it will not prevent from changing in-kernel driver - place a hook into skb allocation path and allocate data from opposing memory - get pages from another side and put them into fragments, then copy headers into skb-data. I don't understand this (opposing memory, another side?). Can you elaborate? You want to implement zero-copy network device between host and guest, if I understood this thread correctly? So, for sending part, device allocates pages from receiver's memory (or from shared memory), receiver gets an 'interrupt' and got pages from own memory, which are attached to new skb and transferred up to the network stack. It can be extended to use shared ring of pages. This is what Xen does. It is actually less performant than copying, IIRC. The problem with flipping pages around is that physical addresses are cached both in the kvm mmu and in the on-chip tlbs, necessitating expensive page table walks and tlb invalidation IPIs. Note that for sending from the guest an external host can be done copylessly, and for the receive side using a dma engine (like I/OAT) can reduce the cost of the copy. -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Tue, Apr 10, 2007 at 02:21:24PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: You want to implement zero-copy network device between host and guest, if I understood this thread correctly? So, for sending part, device allocates pages from receiver's memory (or from shared memory), receiver gets an 'interrupt' and got pages from own memory, which are attached to new skb and transferred up to the network stack. It can be extended to use shared ring of pages. This is what Xen does. It is actually less performant than copying, IIRC. The problem with flipping pages around is that physical addresses are cached both in the kvm mmu and in the on-chip tlbs, necessitating expensive page table walks and tlb invalidation IPIs. Hmm, I'm not familiar with Xen driver, but similar technique was used with zero-copy network sniffer some time ago, substituting userspace pages with pages containing skb data was about 25-50% faster than copying 1500 bytes in general, and in order of 10 times faster in some cases. Check a link please in case we are talking about different ideas: http://marc.info/?l=linux-netdevm=112262743505711w=2 -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Evgeniy Polyakov wrote: This is what Xen does. It is actually less performant than copying, IIRC. The problem with flipping pages around is that physical addresses are cached both in the kvm mmu and in the on-chip tlbs, necessitating expensive page table walks and tlb invalidation IPIs. Hmm, I'm not familiar with Xen driver, but similar technique was used with zero-copy network sniffer some time ago, substituting userspace pages with pages containing skb data was about 25-50% faster than copying 1500 bytes in general, and in order of 10 times faster in some cases. Check a link please in case we are talking about different ideas: http://marc.info/?l=linux-netdevm=112262743505711w=2 I don't really understand what you're testing there. in particular, how can the copying time change so dramatically depending on whether you've just rebooted or not? -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Tue, Apr 10, 2007 at 03:17:45PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: Check a link please in case we are talking about different ideas: http://marc.info/?l=linux-netdevm=112262743505711w=2 I don't really understand what you're testing there. in particular, how can the copying time change so dramatically depending on whether you've just rebooted or not? I tested page remapping time - i.e. time to replace a page in two different mappings - the same should be performed in host and guest kernels if such design is going to be used for communication. I can only explain after-reboot slow copy with empty caches - arbitrary kernel pages were copied into buffer (not the same data as in posted code). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Evgeniy Polyakov wrote: On Tue, Apr 10, 2007 at 03:17:45PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote: Check a link please in case we are talking about different ideas: http://marc.info/?l=linux-netdevm=112262743505711w=2 I don't really understand what you're testing there. in particular, how can the copying time change so dramatically depending on whether you've just rebooted or not? I tested page remapping time - i.e. time to replace a page in two different mappings - the same should be performed in host and guest kernels if such design is going to be used for communication. I can only explain after-reboot slow copy with empty caches - arbitrary kernel pages were copied into buffer (not the same data as in posted code). Doing this in kvm would be significantly more complex, as we'd need to use full reverse mapping to locate all guest mappings (we already reverse map writable pages for other reasons), so the 25-50% difference might be nullified or even turn into overhead. Here are the Xen numbers for reference. Xen probably has more overhead than kvm for such things, though, as it needs to do hypercalls from dom0 which is in-kernel for kvm. http://lists.xensource.com/archives/html/xen-devel/2007-03/msg01218.html -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote: Moreover, some things just don't lend themselves to a userspace abstraction. If we want to expose tso (tcp segmentation offload), we can easily do so with a kernel driver since the kernel interfaces are all tso aware. Tacking on tso awareness to tun/tap is doable, but at the very least wierd. It is kinda weird, yes, but it certainly makes sense. All the arguments for tso apply in triplicate to userspace packet sends... We're dealing with the tun/tap device here, not a socket. Hmm. tun actually has aio_write implemented, but it seems synchronous. So does the read path. If these are made truly asynchronous, and the write path is made in addition copyless, then we might have something workable. I still cringe at having a pagetable walk in order to deliver a 1500-byte packet. Right, now we're talking! However, it's not clear to me why creating an skb which references a kvm guest's memory doesn't need a pagetable walk, but a packet in (other) userspace memory does? My conviction which started this discussion is that if we can offer an efficient interface for kvm, we should be able to offer an efficient interface for any (other) userspace. As to async, I'm not *so* worried about that for the moment, although it would probably be nicer to fail than to block. Otherwise we could simply set an skb destructor to wake us up. Again, sendfile is a *much* harder problem than sending a single packet once, which is the question here. sendfile() is a *different* problem. It doesn't need completion because the data is assumed not to change under it. Well, let's not argue over that, it's irrelevant. Hopefully we can do that over a beer or equivalent sometime. I think the first step is to see how much worse a decent userspace net driver is compared with the current in-kernel one. Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote: Moreover, some things just don't lend themselves to a userspace abstraction. If we want to expose tso (tcp segmentation offload), we can easily do so with a kernel driver since the kernel interfaces are all tso aware. Tacking on tso awareness to tun/tap is doable, but at the very least wierd. It is kinda weird, yes, but it certainly makes sense. All the arguments for tso apply in triplicate to userspace packet sends... Well, write() with a large buffer is a sort of tso device. The problem is tso breaks through several layers (like I'm advocating in the other thread :), pushing tcp functionality into ethernet. Well, we've seen worse. We're dealing with the tun/tap device here, not a socket. Hmm. tun actually has aio_write implemented, but it seems synchronous. So does the read path. If these are made truly asynchronous, and the write path is made in addition copyless, then we might have something workable. I still cringe at having a pagetable walk in order to deliver a 1500-byte packet. Right, now we're talking! However, it's not clear to me why creating an skb which references a kvm guest's memory doesn't need a pagetable walk, but a packet in (other) userspace memory does? Currently guest pages are stashed in a kernel array, as well as being mmap()ed into user space. That's not a very strong argument though, as I'd like to be map userspace memory into the guest, or map address_spaces to the guest, or something, so accessing guest physical memory will become more expensive in time. My conviction which started this discussion is that if we can offer an efficient interface for kvm, we should be able to offer an efficient interface for any (other) userspace. Fully agreed. It's mostly a question of who and when. Designing and implementing this interface is going to be difficult, require deep knowledge of Linux networking, and consume a lot of time. As to async, I'm not *so* worried about that for the moment, although it would probably be nicer to fail than to block. Otherwise we could simply set an skb destructor to wake us up. Nope. Being async is critical for copyless networking: - in the transmit path, so need to stop the sender (guest) from touching the memory until it's on the wire. This means 100% of packets sent will be blocked. - in the receive path, you could separate receive notification from the single copy that must be done (like poll() + read()), but to make use of dma engines you need to provide the end address beforehand. I think the first step is to see how much worse a decent userspace net driver is compared with the current in-kernel one. A userspace net interface needs to provide the following: - true async operations - multiple packets per operation (for interrupt mitigation) (like lio_listio) - scatter/gather packets (iovecs) - configurable wakeup (by packet count/timeout) for queue management - hacks (tso) Most of these can be provided by a combination of the pending aio work, the pending aio/fd integration, and the not-so-pending tap aio work. As the first two are available as patches and the third is limited to the tap device, it is not unreasonable to try it out. Maybe it will turn out not to be as difficult as I predicted just a few lines above. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Sun, 2007-04-08 at 08:36 +0300, Avi Kivity wrote: Rusty Russell wrote: Hi Avi, I don't think you've thought about this very hard. The receive copy is completely independent with whether the packet is going to the guest via a kernel driver or via userspace, so not relevant. A packet received in the kernel cannot be made available to userspace in a safe manner without a copy, as it will not be aligned with page boundaries, so userspace cannot examine the packet until after one copy has occured. Hi Avi! I'm a little puzzled by your response. Hmm... lguest's userspace network frontend does exactly as many copies as Ingo's in-host-kernel code. One from the Guest, one to the Guest. kvm pvnet is suboptimal now. The number of copies could be reduced by two (to zero), by constructing an skb that points to guest memory. Right now, this can only be done in-kernel. With current userspace networking interfaces, one cannot build a network device that has less than one copy on transmit, because sendmsg() *must* copy the data (as there is no completion notification). sendfilev(), even if it existed, cannot be used: it is copyless, but lacks completion notification. It is useful only on unchanging data like read-only files. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote: Rusty Russell wrote: I'm a little puzzled by your response. Hmm... lguest's userspace network frontend does exactly as many copies as Ingo's in-host-kernel code. One from the Guest, one to the Guest. kvm pvnet is suboptimal now. The number of copies could be reduced by two (to zero), by constructing an skb that points to guest memory. Right now, this can only be done in-kernel. Sorry, you lost me here. You mean both input and output copies can be eliminated? Or are you talking about another two copies somewhere? But I don't get this we can enhance the kernel but not userspace vibe 8( With current userspace networking interfaces, one cannot build a network device that has less than one copy on transmit, because sendmsg() *must* copy the data (as there is no completion notification). Why are you talking about sendmsg()? Perhaps this is where we're getting tangled up. We're dealing with the tun/tap device here, not a socket. sendfilev(), even if it existed, cannot be used: it is copyless, but lacks completion notification. It is useful only on unchanging data like read-only files. Again, sendfile is a *much* harder problem than sending a single packet once, which is the question here. Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote: Rusty Russell wrote: I'm a little puzzled by your response. Hmm... lguest's userspace network frontend does exactly as many copies as Ingo's in-host-kernel code. One from the Guest, one to the Guest. kvm pvnet is suboptimal now. The number of copies could be reduced by two (to zero), by constructing an skb that points to guest memory. Right now, this can only be done in-kernel. Sorry, you lost me here. You mean both input and output copies can be eliminated? Or are you talking about another two copies somewhere? On the transmit path, current kvm pvnet has two copies: 1. on the guest side, the driver copies the skb data into the shared ring 2. on the host side, the device copies the data from the ring into a newly allocated skb Both of these copies can be eliminated with a host-side kernel. With current userspace interfaces, only one copy can be eliminated. Similar logic applies to receive, except that one copy must remain. But I don't get this we can enhance the kernel but not userspace vibe 8( I've been waiting for network aio since ~2003. If it arrives in the next few days, I'm all for it; much more than kvm can use it profitably. But I'm not going to write that interface myself. Moreover, some things just don't lend themselves to a userspace abstraction. If we want to expose tso (tcp segmentation offload), we can easily do so with a kernel driver since the kernel interfaces are all tso aware. Tacking on tso awareness to tun/tap is doable, but at the very least wierd. With current userspace networking interfaces, one cannot build a network device that has less than one copy on transmit, because sendmsg() *must* copy the data (as there is no completion notification). Why are you talking about sendmsg()? Perhaps this is where we're getting tangled up. We're dealing with the tun/tap device here, not a socket. Hmm. tun actually has aio_write implemented, but it seems synchronous. So does the read path. If these are made truly asynchronous, and the write path is made in addition copyless, then we might have something workable. I still cringe at having a pagetable walk in order to deliver a 1500-byte packet. sendfilev(), even if it existed, cannot be used: it is copyless, but lacks completion notification. It is useful only on unchanging data like read-only files. Again, sendfile is a *much* harder problem than sending a single packet once, which is the question here. sendfile() is a *different* problem. It doesn't need completion because the data is assumed not to change under it. Consider that the guest may be issuing a megabyte-sized sendfile() which is broken into 17 tso frames. We need to preserve the large structures as much as possible or we end up repeating the simple single packet once path 700 times. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Sun, Apr 08, 2007 at 08:36:14AM +0300, Avi Kivity wrote: That is not the common case. Nor is it true when there is a mismatch between the card's capabilties and guest expectations and constraints. For example, guest memory is not physically contiguous so a NIC that won't do scatter/gather will require bouncing (or an iommu, but that's not here yet). Actually, Allen Key from Intel just posted the first VT-d patches to xen-devel a couple of days ago. I wonder if anyone is working on kvm support (which would require Linux support). Cheers, Muli - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Sun, 2007-04-08 at 08:36 +0300, Avi Kivity wrote: Rusty Russell wrote: Hi Avi, I don't think you've thought about this very hard. The receive copy is completely independent with whether the packet is going to the guest via a kernel driver or via userspace, so not relevant. A packet received in the kernel cannot be made available to userspace in a safe manner without a copy, as it will not be aligned with page boundaries, so userspace cannot examine the packet until after one copy has occured. Hi Avi! I'm a little puzzled by your response. Hmm... lguest's userspace network frontend does exactly as many copies as Ingo's in-host-kernel code. One from the Guest, one to the Guest. Does that clarify? Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: On Thu, 2007-04-05 at 10:17 +0300, Avi Kivity wrote: Rusty Russell wrote: You didn't quote Anthony's point about it's more about there not being good enough userspace interfaces to do network IO. It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. In the case of networking, the copyful interfaces on receive are driven by the hardware not knowing how to split the header from the data. On transmit I agree, it could be made copyless from userspace (somthing like sendfilev, only not file oriented). Hi Avi, I don't think you've thought about this very hard. The receive copy is completely independent with whether the packet is going to the guest via a kernel driver or via userspace, so not relevant. A packet received in the kernel cannot be made available to userspace in a safe manner without a copy, as it will not be aligned with page boundaries, so userspace cannot examine the packet until after one copy has occured. After userspace has determined what to do with the packet, another copy must take place to get it there. There's a counterexample, mmapped sockets, but that works only when all packets arriving on a card are exposed to the same process. This is useful for tcpdump or for what you outline below but is hardly generic. And if all packets from the card are going to the guest, you can deliver directly. Userspace or kernel, no difference. That is not the common case. Nor is it true when there is a mismatch between the card's capabilties and guest expectations and constraints. For example, guest memory is not physically contiguous so a NIC that won't do scatter/gather will require bouncing (or an iommu, but that's not here yet). And we have a sendfilev not file oriented: it's called writev 8) writev() cannot be made copyless for networking. One needs an async interface so the kernel can complete the write after the NIC acks the dma transfer, or a kernel driver. An in-kernel driver can avoid system call overhead and page references. But a better tap device helps more than just KVM. I'll believe it when I see it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Ingo Molnar [EMAIL PROTECTED] wrote: * Anthony Liguori [EMAIL PROTECTED] wrote: [...] Did Linux have extremely high quality code in 1994? yes! It was crutial to strive for extremely high quality code all the time. That was the only way to grow Linux's codebase, which was ~300,000 lines of code in 1994, to the current 7.2+ million lines of code, without losing maintainability. [...] in fact Linux 1.0, released in early 1994, was only 170,000 LOC: http://www.kernel.org/pub/linux/kernel/v1.0/linux-1.0.tar.gz and i just looked at a few random files in it - it's pretty clean. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Rusty Russell [EMAIL PROTECTED] wrote: prototyping new kernel APIs to implement user-space network drivers, on a crufty codebase is not something that should be done lightly. I think you overestimate my radicalism. I was considering readv() and writev() on the tap device. ok :-) How would packeting be handled, or would this be alike a raw socket in essence, but not in 'peek' but 'filter through' mode? I think it's not quite trivial. (but maybe i'm way too radical again :) Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Avi Kivity [EMAIL PROTECTED] wrote: [...] But the difference in cruftiness between kvm and qemu code should not enter into the discussion of where to do things. i agree that it doesnt enter the discussion for the *PIC question, but it very much enters the discussion for the question that i replied to: You didn't quote Anthony's point about it's more about there not being good enough userspace interfaces to do network IO. It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. prototyping new kernel APIs to implement user-space network drivers, on a crufty codebase is not something that should be done lightly. Any negative result will not bring us any real conclusion. (was the failure due to the concept, due the API or due to the crufty codebase?) (but ... this is really a side-track issue for the *PIC question at hand. PICs are not network devices, they are essential platform components and almost an extended part of the CPU.) Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Ingo Molnar wrote: so right now the only option for a clean codebase is the KVM in-kernel code. I strongly disagree with this. Bad code in userspace is not an excuse for shoving stuff into the kernel, where maintaining it is much more expensive, and the cause of a mistake can be system crashes and data loss, affecting unrelated processes. If we move something into the kernel, we'd better have a really good reason for it. Qemu code _is_ crufty. We can do one of three things: 1. live with it 2. fork it and clean it up 3. clean it up incrementally and merge it upstream Currently we're doing (1). You're suggesting a variant of (2), fork plus move into the kernel. The right thing to do IMO is (3), but I don't see anybody volunteering. Qemu picked up additional committers recently and I believe they would be receptive to cleanups. [In the *pic/pit case, we have other reasons to push things into the kernel. But this code is crap, let's rewrite it in the kernel is not a justification I'll accept. I'd be much happier if we could quantify these other reasons.] -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Ingo Molnar [EMAIL PROTECTED] wrote: * Rusty Russell [EMAIL PROTECTED] wrote: It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. yes, i agree in theory, [...] let me explain my position a bit more verbosely: i agree in terms of 'network driver' (and more generally in terms of 'device', which includes network, storage, console, etc. devices): having a user-space driver option should still be possible and it should be integrated well. Qemu is quite rich and flexible in these areas and we dont want to throw away or isolate that body of code. but i dont agree in terms of PIC code, which is the main argument in this particular thread. There's little precedent for any add-ons for PICs in user-space, nor any particular PIC handling richness in Qemu that we'd like to preserve. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Ingo Molnar wrote: * Avi Kivity [EMAIL PROTECTED] wrote: so right now the only option for a clean codebase is the KVM in-kernel code. I strongly disagree with this. are you disagreeing with my statement that the KVM kernel-side code is the only clean codebase here? To me this is a clear fact :) No, I agree with that. I just disagree with choosing to put the *pic code (or other code) into the kernel on *that* basis. The selection should be on design/performance issues alone, *not* the state of existing code. I only pointed out that the only clean codebase at the moment is the KVM in-kernel code - i did not make the argument (at all) that every new piece of KVM code should be done in the kernel. That would be stupid - do you think i'd advocate for example moving command line argument parsing into the kernel? No. But the difference in cruftiness between kvm and qemu code should not enter into the discussion of where to do things. and as i said in the mail: the kernel _is_ the best place to do this particular stuff. I agree with this, maybe for different reasons. -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Avi Kivity [EMAIL PROTECTED] wrote: so right now the only option for a clean codebase is the KVM in-kernel code. I strongly disagree with this. are you disagreeing with my statement that the KVM kernel-side code is the only clean codebase here? To me this is a clear fact :) I only pointed out that the only clean codebase at the moment is the KVM in-kernel code - i did not make the argument (at all) that every new piece of KVM code should be done in the kernel. That would be stupid - do you think i'd advocate for example moving command line argument parsing into the kernel? and as i said in the mail: the kernel _is_ the best place to do this particular stuff. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
* Rusty Russell [EMAIL PROTECTED] wrote: It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. yes, i agree in theory, but IMO this is largely beside the point. What matters most for developing a project is _the quality of the codebase_. That attracts developers, developers improve the code, which then attracts users, which attracts more developers, etc., etc. As long as the quality of the codebase is maintained, this is a self-sustaining process. You've seen that happen with Linux. [ And of course, the crutial step #0 is: a sane, open-minded maintainer with good taste ;-) ] qemu's code quality is not really suitable for that basic OSS model, in my opinion. It has been a mostly one-man show for a long time with various hostile forks, bin-only kernel module and other actions that easily poison an OSS project. the result is not surprising: important portions of qemu have grown into a hard to hack, hard to maintain codebase with poor code quality, with gems like: #ifdef _WIN32 void CALLBACK host_alarm_handler(UINT uTimerID, UINT uMsg, DWORD_PTR dwUser, DWORD_PTR dw1, DWORD_PTR dw2) #else static void host_alarm_handler(int host_signum) #endif { #if 0 #define DISP_FREQ 1000 and that's not just some random driver - this is _the_ main central timer code of qemu. so right now the only option for a clean codebase is the KVM in-kernel code. It's clean and sweet and integrates nicely into the rest of the kernel. The kernel is also obviously the final place where most virtualization technologies want to show up because it's the entity that is the closest to the guest context: we _dont_ want to _force_ network traffic (let alone interrupt handling) through a userspace context, only if the functionality of the task absolutely requires it. (but in most cases we'll try to come up with a maximally flexible scheme that can just drive things straight via the kernel. netfilter/iptables isnt in user-space either, partly for that reason.) but architectural issues aside (ignoring that the kernel _is_ the best place to do this particular of stuff), this question is still mainly dominated by the basic question of code quality. I'd rather move something into the Linux kernel, enforce its code quality that way, and _then_ add whatever clean infrastructure is needed to push it back into user-space again (into a different codebase), than having to hack the monolithic 200 KLOC+ qemu codebase that is shackled with support for tons of arcane architectures nobody uses and tons of arcane OS variants that no-one cares about. Now qemu is a very important enabler and platform-reference-implementation for KVM to fall back to, but it's not the place to put crutial new code into, at least currently. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Rusty Russell wrote: You didn't quote Anthony's point about it's more about there not being good enough userspace interfaces to do network IO. It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. In the case of networking, the copyful interfaces on receive are driven by the hardware not knowing how to split the header from the data. On transmit I agree, it could be made copyless from userspace (somthing like sendfilev, only not file oriented). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
Ingo Molnar wrote: * Rusty Russell [EMAIL PROTECTED] wrote: It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. yes, i agree in theory, but IMO this is largely beside the point. What matters most for developing a project is _the quality of the codebase_. That attracts developers, developers improve the code, which then attracts users, which attracts more developers, etc., etc. As long as the quality of the codebase is maintained, this is a self-sustaining process. You've seen that happen with Linux. [ And of course, the crutial step #0 is: a sane, open-minded maintainer with good taste ;-) ] qemu's code quality is not really suitable for that basic OSS model, in my opinion. I think you may want to step off your high horse there. QEMU's code may not be Linux kernel quality but it's certainly not anywhere near the worst that is out there. Linux is over decade old. QEMU is only around 3 years old. Did Linux have extremely high quality code in 1994? Instead of posting code snippets to LKML, it would be much more constructive to post patches to qemu-devel. It's not like the QEMU maintainers are actively ignoring your efforts to improve the code. but architectural issues aside (ignoring that the kernel _is_ the best place to do this particular of stuff), Right. We don't put things in the kernel just because we don't like the way the userspace code is written. If that logic was valid, then Linus would be working on moving all of Gnome into the kernel. This discussion has two parts. The first is whether or not the kernel is the right place for a paravirtual network driver backend. My current believe is that we could not get enough performance from something like tun to do it in userspace. I also believe that we could improve tun (or create a replacement) so that we could implement a PV network driver backend in userspace. Admittedly, I'm not an expert in networking though so I could be wrong here. The second part is whether the platform devices should go in the kernel. I agree with you that having the PIT in the kernel is probably a good idea. I also agree that we probably have no choice but to move the APIC into the kernel (not for PV drivers, but for TPR performance and SMP support). Regards, Anthony Liguori this question is still mainly dominated by the basic question of code quality. I'd rather move something into the Linux kernel, enforce its code quality that way, and _then_ add whatever clean infrastructure is needed to push it back into user-space again (into a different codebase), than having to hack the monolithic 200 KLOC+ qemu codebase that is shackled with support for tons of arcane architectures nobody uses and tons of arcane OS variants that no-one cares about. Now qemu is a very important enabler and platform-reference-implementation for KVM to fall back to, but it's not the place to put crutial new code into, at least currently. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Thu, 2007-04-05 at 13:36 +0200, Ingo Molnar wrote: prototyping new kernel APIs to implement user-space network drivers, on a crufty codebase is not something that should be done lightly. I think you overestimate my radicalism. I was considering readv() and writev() on the tap device. Qemu's infrastructure may hurt kvm here, but lguest won't be able to use that excuse. track issue for the *PIC question at hand. PICs are not network devices, they are essential platform components and almost an extended part of the CPU.) Definitely, I'm only interested in stealing^H^H^Hsharing KVM devices. The subject is now deeply misleading 8( Cheers, Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
On Wed, 2007-04-04 at 23:21 +0200, Ingo Molnar wrote: * Anthony Liguori [EMAIL PROTECTED] wrote: But why is it a good thing to do PV drivers in the kernel? You lose flexibility and functionality to gain performance. [...] in Linux a kernel-space network driver can still be tunneled over user-space code, and hence you can add arbitrary add-on functionality (and thus have flexibility), without slowing down the common case (which would be to tunnel the guest's network traffic into the firewall rules of the kernel. No need to touch user-space for any of that). You didn't quote Anthony's point about it's more about there not being good enough userspace interfaces to do network IO. It's easier to write a kernel-space network driver, but it's not obviously the right thing to do until we can show that an efficient packet-level userspace interface isn't possible. I don't think that's been done, and it would be interesting to try. Cheers, Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html