Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 10/01/2009 09:24 PM, Gregory Haskins wrote: > >> Virtualization is about not doing that. Sometimes it's necessary (when >> you have made unfixable design mistakes), but just to replace a bus, >> with no advantages to the guest that has to be changed (other >> hypervisors or hypervisorless deployment scenarios aren't). >> > The problem is that your continued assertion that there is no advantage > to the guest is a completely unsubstantiated claim. As it stands right > now, I have a public git tree that, to my knowledge, is the fastest KVM > PV networking implementation around. It also has capabilities that are > demonstrably not found elsewhere, such as the ability to render generic > shared-memory interconnects (scheduling, timers), interrupt-priority > (qos), and interrupt-coalescing (exit-ratio reduction). I designed each > of these capabilities after carefully analyzing where KVM was coming up > short. > > Those are facts. > > I can't easily prove which of my new features alone are what makes it > special per se, because I don't have unit tests for each part that > breaks it down. What I _can_ state is that its the fastest and most > feature rich KVM-PV tree that I am aware of, and others may download and > test it themselves to verify my claims. > If you wish to introduce a feature which has downsides (and to me, vbus has downsides) then you must prove it is necessary on its own merits. venet is pretty cool but I need proof before I believe its performance is due to vbus and not to venet-host. > The disproof, on the other hand, would be in a counter example that > still meets all the performance and feature criteria under all the same > conditions while maintaining the existing ABI. To my knowledge, this > doesn't exist. > mst is working on it and we should have it soon. > Therefore, if you believe my work is irrelevant, show me a git tree that > accomplishes the same feats in a binary compatible way, and I'll rethink > my position. Until then, complaining about lack of binary compatibility > is pointless since it is not an insurmountable proposition, and the one > and only available solution declares it a required casualty. > Fine, let's defer it until vhost-net is up and running. >> Well, Xen requires pre-translation (since the guest has to give the host >> (which is just another guest) permissions to access the data). >> > Actually I am not sure that it does require pre-translation. You might > be able to use the memctx->copy_to/copy_from scheme in post translation > as well, since those would be able to communicate to something like the > xen kernel. But I suppose either method would result in extra exits, so > there is no distinct benefit using vbus there..as you say below "they're > just different". > > The biggest difference is that my proposed model gets around the notion > that the entire guest address space can be represented by an arbitrary > pointer. For instance, the copy_to/copy_from routines take a GPA, but > may use something indirect like a DMA controller to access that GPA. On > the other hand, virtio fully expects a viable pointer to come out of the > interface iiuc. This is in part what makes vbus more adaptable to non-virt. > No, virtio doesn't expect a pointer (this is what makes Xen possible). vhost does; but nothing prevents an interested party from adapting it. >>> An interesting thing here is that you don't even need a fancy >>> multi-homed setup to see the effects of my exit-ratio reduction work: >>> even single port configurations suffer from the phenomenon since many >>> devices have multiple signal-flows (e.g. network adapters tend to have >>> at least 3 flows: rx-ready, tx-complete, and control-events (link-state, >>> etc). Whats worse, is that the flows often are indirectly related (for >>> instance, many host adapters will free tx skbs during rx operations, so >>> you tend to get bursts of tx-completes at the same time as rx-ready. If >>> the flows map 1:1 with IDT, they will suffer the same problem. >>> >>> >> You can simply use the same vector for both rx and tx and poll both at >> every interrupt. >> > Yes, but that has its own problems: e.g. additional exits or at least > additional overhead figuring out what happens each time. If you're just coalescing tx and rx, it's an additional memory read (which you have anyway in the vbus interrupt queue). > This is even > more important as we scale out to MQ which may have dozens of queue > pairs. You really want finer grained signal-path decode if you want > peak performance. > MQ definitely wants per-queue or per-queue-pair vectors, and it definitely doesn't want all interrupts to be serviced by a single interrupt queue (you could/should make the queue per-vcpu). >>> Its important to note here that we are actually looking at the interrupt >>> rate, not the exit rate (which is usually a multiple of the interrupt >>> rate, since you have to
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/30/2009 10:04 PM, Gregory Haskins wrote: > > >>> A 2.6.27 guest, or Windows guest with the existing virtio drivers, >>> won't work >>> over vbus. >>> >> Binary compatibility with existing virtio drivers, while nice to have, >> is not a specific requirement nor goal. We will simply load an updated >> KMP/MSI into those guests and they will work again. As previously >> discussed, this is how more or less any system works today. It's like >> we are removing an old adapter card and adding a new one to "uprev the >> silicon". >> > > Virtualization is about not doing that. Sometimes it's necessary (when > you have made unfixable design mistakes), but just to replace a bus, > with no advantages to the guest that has to be changed (other > hypervisors or hypervisorless deployment scenarios aren't). The problem is that your continued assertion that there is no advantage to the guest is a completely unsubstantiated claim. As it stands right now, I have a public git tree that, to my knowledge, is the fastest KVM PV networking implementation around. It also has capabilities that are demonstrably not found elsewhere, such as the ability to render generic shared-memory interconnects (scheduling, timers), interrupt-priority (qos), and interrupt-coalescing (exit-ratio reduction). I designed each of these capabilities after carefully analyzing where KVM was coming up short. Those are facts. I can't easily prove which of my new features alone are what makes it special per se, because I don't have unit tests for each part that breaks it down. What I _can_ state is that its the fastest and most feature rich KVM-PV tree that I am aware of, and others may download and test it themselves to verify my claims. The disproof, on the other hand, would be in a counter example that still meets all the performance and feature criteria under all the same conditions while maintaining the existing ABI. To my knowledge, this doesn't exist. Therefore, if you believe my work is irrelevant, show me a git tree that accomplishes the same feats in a binary compatible way, and I'll rethink my position. Until then, complaining about lack of binary compatibility is pointless since it is not an insurmountable proposition, and the one and only available solution declares it a required casualty. > >>> Further, non-shmem virtio can't work over vbus. >>> >> Actually I misspoke earlier when I said virtio works over non-shmem. >> Thinking about it some more, both virtio and vbus fundamentally require >> shared-memory, since sharing their metadata concurrently on both sides >> is their raison d'être. >> >> The difference is that virtio utilizes a pre-translation/mapping (via >> ->add_buf) from the guest side. OTOH, vbus uses a post translation >> scheme (via memctx) from the host-side. If anything, vbus is actually >> more flexible because it doesn't assume the entire guest address space >> is directly mappable. >> >> In summary, your statement is incorrect (though it is my fault for >> putting that idea in your head). >> > > Well, Xen requires pre-translation (since the guest has to give the host > (which is just another guest) permissions to access the data). Actually I am not sure that it does require pre-translation. You might be able to use the memctx->copy_to/copy_from scheme in post translation as well, since those would be able to communicate to something like the xen kernel. But I suppose either method would result in extra exits, so there is no distinct benefit using vbus there..as you say below "they're just different". The biggest difference is that my proposed model gets around the notion that the entire guest address space can be represented by an arbitrary pointer. For instance, the copy_to/copy_from routines take a GPA, but may use something indirect like a DMA controller to access that GPA. On the other hand, virtio fully expects a viable pointer to come out of the interface iiuc. This is in part what makes vbus more adaptable to non-virt. > So neither is a superset of the other, they're just different. > > It doesn't really matter since Xen is unlikely to adopt virtio. Agreed. > >> An interesting thing here is that you don't even need a fancy >> multi-homed setup to see the effects of my exit-ratio reduction work: >> even single port configurations suffer from the phenomenon since many >> devices have multiple signal-flows (e.g. network adapters tend to have >> at least 3 flows: rx-ready, tx-complete, and control-events (link-state, >> etc). Whats worse, is that the flows often are indirectly related (for >> instance, many host adapters will free tx skbs during rx operations, so >> you tend to get bursts of tx-completes at the same time as rx-ready. If >> the flows map 1:1 with IDT, they will suffer the same problem. >> > > You can simply use the same vector for both rx and tx and poll both at > every interrupt. Yes, but that has its own problems: e.
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Oct 01, 2009 at 10:34:17AM +0200, Avi Kivity wrote: >> Second, I do not use ioeventfd anymore because it has too many problems >> with the surrounding technology. However, that is a topic for a >> different thread. >> > > Please post your issues. I see ioeventfd/irqfd as critical kvm interfaces. I second that. AFAIK ioeventfd/irqfd got exposed to userspace in 2.6.32-rc1, if there are issues we better nail them before 2.6.32 is out. And yes, please start a different thread. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/30/2009 10:04 PM, Gregory Haskins wrote: >> A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work >> over vbus. >> > Binary compatibility with existing virtio drivers, while nice to have, > is not a specific requirement nor goal. We will simply load an updated > KMP/MSI into those guests and they will work again. As previously > discussed, this is how more or less any system works today. It's like > we are removing an old adapter card and adding a new one to "uprev the > silicon". > Virtualization is about not doing that. Sometimes it's necessary (when you have made unfixable design mistakes), but just to replace a bus, with no advantages to the guest that has to be changed (other hypervisors or hypervisorless deployment scenarios aren't). >> Further, non-shmem virtio can't work over vbus. >> > Actually I misspoke earlier when I said virtio works over non-shmem. > Thinking about it some more, both virtio and vbus fundamentally require > shared-memory, since sharing their metadata concurrently on both sides > is their raison d'être. > > The difference is that virtio utilizes a pre-translation/mapping (via > ->add_buf) from the guest side. OTOH, vbus uses a post translation > scheme (via memctx) from the host-side. If anything, vbus is actually > more flexible because it doesn't assume the entire guest address space > is directly mappable. > > In summary, your statement is incorrect (though it is my fault for > putting that idea in your head). > Well, Xen requires pre-translation (since the guest has to give the host (which is just another guest) permissions to access the data). So neither is a superset of the other, they're just different. It doesn't really matter since Xen is unlikely to adopt virtio. > An interesting thing here is that you don't even need a fancy > multi-homed setup to see the effects of my exit-ratio reduction work: > even single port configurations suffer from the phenomenon since many > devices have multiple signal-flows (e.g. network adapters tend to have > at least 3 flows: rx-ready, tx-complete, and control-events (link-state, > etc). Whats worse, is that the flows often are indirectly related (for > instance, many host adapters will free tx skbs during rx operations, so > you tend to get bursts of tx-completes at the same time as rx-ready. If > the flows map 1:1 with IDT, they will suffer the same problem. > You can simply use the same vector for both rx and tx and poll both at every interrupt. > In any case, here is an example run of a simple single-homed guest over > standard GigE. Whats interesting here is that .qnotify to .notify > ratio, as this is the interrupt-to-signal ratio. In this case, its > 170047/151918, which comes out to about 11% savings in interrupt injections: > > vbus-guest:/home/ghaskins # netperf -H dev > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > dev.laurelwood.net (192.168.1.10) port 0 AF_INET > Recv SendSend > Socket Socket Message Elapsed > Size SizeSize Time Throughput > bytes bytes bytessecs.10^6bits/sec > > 1048576 16384 1638410.01 940.77 > vbus-guest:/home/ghaskins # cat /sys/kernel/debug/pci-to-vbus-bridge >.events: 170048 >.qnotify : 151918 >.qinject : 0 >.notify: 170047 >.inject: 18238 >.bridgecalls : 18 >.buscalls : 12 > vbus-guest:/home/ghaskins # cat /proc/interrupts > CPU0 > 0: 87 IO-APIC-edge timer > 1: 6 IO-APIC-edge i8042 > 4:733 IO-APIC-edge serial > 6: 2 IO-APIC-edge floppy > 7: 0 IO-APIC-edge parport0 > 8: 0 IO-APIC-edge rtc0 > 9: 0 IO-APIC-fasteoi acpi >10: 0 IO-APIC-fasteoi virtio1 >12: 90 IO-APIC-edge i8042 >14: 3041 IO-APIC-edge ata_piix >15: 1008 IO-APIC-edge ata_piix >24: 151933 PCI-MSI-edge vbus >25: 0 PCI-MSI-edge virtio0-config >26:190 PCI-MSI-edge virtio0-input >27: 28 PCI-MSI-edge virtio0-output > NMI: 0 Non-maskable interrupts > LOC: 9854 Local timer interrupts > SPU: 0 Spurious interrupts > CNT: 0 Performance counter interrupts > PND: 0 Performance pending work > RES: 0 Rescheduling interrupts > CAL: 0 Function call interrupts > TLB: 0 TLB shootdowns > TRM: 0 Thermal event interrupts > THR: 0 Threshold APIC interrupts > MCE: 0 Machine check exceptions > MCP: 1 Machine check polls > ERR: 0 > MIS: 0 > > Its important to note here that we are actually loo
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/26/2009 12:32 AM, Gregory Haskins wrote: I realize in retrospect that my choice of words above implies vbus _is_ complete, but this is not what I was saying. What I was trying to convey is that vbus is _more_ complete. Yes, in either case some kind of glue needs to be written. The difference is that vbus implements more of the glue generally, and leaves less required to be customized for each iteration. >>> >>> No argument there. Since you care about non-virt scenarios and virtio >>> doesn't, naturally vbus is a better fit for them as the code stands. >>> >> Thanks for finally starting to acknowledge there's a benefit, at least. >> > > I think I've mentioned vbus' finer grained layers as helpful here, > though I doubt the value of this. Hypervisors are added rarely, while > devices and drivers are added (and modified) much more often. I don't > buy the anything-to-anything promise. The ease in which a new hypervisor should be able to integrate into the stack is only one of vbus's many benefits. > >> To be more precise, IMO virtio is designed to be a performance oriented >> ring-based driver interface that supports all types of hypervisors (e.g. >> shmem based kvm, and non-shmem based Xen). vbus is designed to be a >> high-performance generic shared-memory interconnect (for rings or >> otherwise) framework for environments where linux is the underpinning >> "host" (physical or virtual). They are distinctly different, but >> complementary (the former addresses the part of the front-end, and >> latter addresses the back-end, and a different part of the front-end). >> > > They're not truly complementary since they're incompatible. No, that is incorrect. Not to be rude, but for clarity: Complementary \Com`ple*men"ta*ry\, a. Serving to fill out or to complete; as, complementary numbers. [1913 Webster] Citation: www.dict.org IOW: Something being complementary has nothing to do with guest/host binary compatibility. virtio-pci and virtio-vbus are both equally complementary to virtio since they fill in the bottom layer of the virtio stack. So yes, vbus is truly complementary to virtio afaict. > A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work > over vbus. Binary compatibility with existing virtio drivers, while nice to have, is not a specific requirement nor goal. We will simply load an updated KMP/MSI into those guests and they will work again. As previously discussed, this is how more or less any system works today. It's like we are removing an old adapter card and adding a new one to "uprev the silicon". > Further, non-shmem virtio can't work over vbus. Actually I misspoke earlier when I said virtio works over non-shmem. Thinking about it some more, both virtio and vbus fundamentally require shared-memory, since sharing their metadata concurrently on both sides is their raison d'être. The difference is that virtio utilizes a pre-translation/mapping (via ->add_buf) from the guest side. OTOH, vbus uses a post translation scheme (via memctx) from the host-side. If anything, vbus is actually more flexible because it doesn't assume the entire guest address space is directly mappable. In summary, your statement is incorrect (though it is my fault for putting that idea in your head). > Since > virtio is guest-oriented and host-agnostic, it can't ignore > non-shared-memory hosts (even though it's unlikely virtio will be > adopted there) Well, to be fair no one said it has to ignore them. Either virtio-vbus transport is present and available to the virtio stack, or it isn't. If its present, it may or may not publish objects for consumption. Providing a virtio-vbus transport in no way limits or degrades the existing capabilities of the virtio stack. It only enhances them. I digress. The whole point is moot since I realized that the non-shmem distinction isn't accurate anyway. They both require shared-memory for the metadata, and IIUC virtio requires the entire address space to be mappable whereas vbus only assumes the metadata is. > >> In addition, the kvm-connector used in AlacrityVM's design strives to >> add value and improve performance via other mechanisms, such as dynamic >> allocation, interrupt coalescing (thus reducing exit-ratio, which is a >> serious issue in KVM) > > Do you have measurements of inter-interrupt coalescing rates (excluding > intra-interrupt coalescing). I actually do not have a rig setup to explicitly test inter-interrupt rates at the moment. Once things stabilize for me, I will try to re-gather some numbers here. Last time I looked, however, there were some decent savings for inter as well. Inter rates are interesting because they are what tends to ramp up with IO load more than intra since guest interrupt mitigation techniques like NAPI often quell intra-rates naturally. This is especially true for data-cent
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/26/2009 12:32 AM, Gregory Haskins wrote: >>> >>> I realize in retrospect that my choice of words above implies vbus _is_ >>> complete, but this is not what I was saying. What I was trying to >>> convey is that vbus is _more_ complete. Yes, in either case some kind >>> of glue needs to be written. The difference is that vbus implements >>> more of the glue generally, and leaves less required to be customized >>> for each iteration. >>> >>> >> >> No argument there. Since you care about non-virt scenarios and virtio >> doesn't, naturally vbus is a better fit for them as the code stands. >> > Thanks for finally starting to acknowledge there's a benefit, at least. > I think I've mentioned vbus' finer grained layers as helpful here, though I doubt the value of this. Hypervisors are added rarely, while devices and drivers are added (and modified) much more often. I don't buy the anything-to-anything promise. > To be more precise, IMO virtio is designed to be a performance oriented > ring-based driver interface that supports all types of hypervisors (e.g. > shmem based kvm, and non-shmem based Xen). vbus is designed to be a > high-performance generic shared-memory interconnect (for rings or > otherwise) framework for environments where linux is the underpinning > "host" (physical or virtual). They are distinctly different, but > complementary (the former addresses the part of the front-end, and > latter addresses the back-end, and a different part of the front-end). > They're not truly complementary since they're incompatible. A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work over vbus. Further, non-shmem virtio can't work over vbus. Since virtio is guest-oriented and host-agnostic, it can't ignore non-shared-memory hosts (even though it's unlikely virtio will be adopted there). > In addition, the kvm-connector used in AlacrityVM's design strives to > add value and improve performance via other mechanisms, such as dynamic > allocation, interrupt coalescing (thus reducing exit-ratio, which is a > serious issue in KVM) Do you have measurements of inter-interrupt coalescing rates (excluding intra-interrupt coalescing). > and priortizable/nestable signals. > That doesn't belong in a bus. > Today there is a large performance disparity between what a KVM guest > sees and what a native linux application sees on that same host. Just > take a look at some of my graphs between "virtio", and "native", for > example: > > http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png > That's a red herring. The problem is not with virtio as an ABI, but with its implementation in userspace. vhost-net should offer equivalent performance to vbus. > A dominant vbus design principle is to try to achieve the same IO > performance for all "linux applications" whether they be literally > userspace applications, or things like KVM vcpus or Ira's physical > boards. It also aims to solve problems not previously expressible with > current technologies (even virtio), like nested real-time. > > And even though you repeatedly insist otherwise, the neat thing here is > that the two technologies mesh (at least under certain circumstances, > like when virtio is deployed on a shared-memory friendly linux backend > like KVM). I hope that my stack diagram below depicts that clearly. > Right, when you ignore the points where they don't fit, it's a perfect mesh. >> But that's not a strong argument for vbus; instead of adding vbus you >> could make virtio more friendly to non-virt >> > Actually, it _is_ a strong argument then because adding vbus is what > helps makes virtio friendly to non-virt, at least for when performance > matters. > As vhost-net shows, you can do that without vbus and without breaking compatibility. >> Right. virtio assumes that it's in a virt scenario and that the guest >> architecture already has enumeration and hotplug mechanisms which it >> would prefer to use. That happens to be the case for kvm/x86. >> > No, virtio doesn't assume that. It's stack provides the "virtio-bus" > abstraction and what it does assume is that it will be wired up to > something underneath. Kvm/x86 conveniently has pci, so the virtio-pci > adapter was created to reuse much of that facility. For other things > like lguest and s360, something new had to be created underneath to make > up for the lack of pci-like support. > Right, I was wrong there. But it does allow you to have a 1:1 mapping between native devices and virtio devices. >>> So to answer your question, the difference is that the part that has to >>> be customized in vbus should be a fraction of what needs to be >>> customized with vhost because it defines more of the stack. >>> >> But if you want to use the native mechanisms, vbus doesn't have any >> added value. >> > First of all, thats incorrect. If you want to use the "native"
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Fri, Sep 25, 2009 at 10:01:58AM -0700, Ira W. Snyder wrote: > > + case VHOST_SET_VRING_KICK: > > + r = copy_from_user(&f, argp, sizeof f); > > + if (r < 0) > > + break; > > + eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); > > + if (IS_ERR(eventfp)) > > + return PTR_ERR(eventfp); > > + if (eventfp != vq->kick) { > > + pollstop = filep = vq->kick; > > + pollstart = vq->kick = eventfp; > > + } else > > + filep = eventfp; > > + break; > > + case VHOST_SET_VRING_CALL: > > + r = copy_from_user(&f, argp, sizeof f); > > + if (r < 0) > > + break; > > + eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); > > + if (IS_ERR(eventfp)) > > + return PTR_ERR(eventfp); > > + if (eventfp != vq->call) { > > + filep = vq->call; > > + ctx = vq->call_ctx; > > + vq->call = eventfp; > > + vq->call_ctx = eventfp ? > > + eventfd_ctx_fileget(eventfp) : NULL; > > + } else > > + filep = eventfp; > > + break; > > + case VHOST_SET_VRING_ERR: > > + r = copy_from_user(&f, argp, sizeof f); > > + if (r < 0) > > + break; > > + eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); > > + if (IS_ERR(eventfp)) > > + return PTR_ERR(eventfp); > > + if (eventfp != vq->error) { > > + filep = vq->error; > > + vq->error = eventfp; > > + ctx = vq->error_ctx; > > + vq->error_ctx = eventfp ? > > + eventfd_ctx_fileget(eventfp) : NULL; > > + } else > > + filep = eventfp; > > + break; > > I'm not sure how these eventfd's save a trip to userspace. > > AFAICT, eventfd's cannot be used to signal another part of the kernel, > they can only be used to wake up userspace. Yes, they can. See irqfd code in virt/kvm/eventfd.c. > In my system, when an IRQ for kick() comes in, I have an eventfd which > gets signalled to notify userspace. When I want to send a call(), I have > to use a special ioctl(), just like lguest does. > > Doesn't this mean that for call(), vhost is just going to signal an > eventfd to wake up userspace, which is then going to call ioctl(), and > then we're back in kernelspace. Seems like a wasted userspace > round-trip. > > Or am I mis-reading this code? Yes. Kernel can poll eventfd and deliver an interrupt directly without involving userspace. > PS - you can see my current code at: > http://www.mmarray.org/~iws/virtio-phys/ > > Thanks, > Ira > > > + default: > > + r = -ENOIOCTLCMD; > > + } > > + > > + if (pollstop && vq->handle_kick) > > + vhost_poll_stop(&vq->poll); > > + > > + if (ctx) > > + eventfd_ctx_put(ctx); > > + if (filep) > > + fput(filep); > > + > > + if (pollstart && vq->handle_kick) > > + vhost_poll_start(&vq->poll, vq->kick); > > + > > + mutex_unlock(&vq->mutex); > > + > > + if (pollstop && vq->handle_kick) > > + vhost_poll_flush(&vq->poll); > > + return 0; > > +} > > + > > +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned > > long arg) > > +{ > > + void __user *argp = (void __user *)arg; > > + long r; > > + > > + mutex_lock(&d->mutex); > > + /* If you are not the owner, you can become one */ > > + if (ioctl == VHOST_SET_OWNER) { > > + r = vhost_dev_set_owner(d); > > + goto done; > > + } > > + > > + /* You must be the owner to do anything else */ > > + r = vhost_dev_check_owner(d); > > + if (r) > > + goto done; > > + > > + switch (ioctl) { > > + case VHOST_SET_MEM_TABLE: > > + r = vhost_set_memory(d, argp); > > + break; > > + default: > > + r = vhost_set_vring(d, ioctl, argp); > > + break; > > + } > > +done: > > + mutex_unlock(&d->mutex); > > + return r; > > +} > > + > > +static const struct vhost_memory_region *find_region(struct vhost_memory > > *mem, > > +__u64 addr, __u32 len) > > +{ > > + struct vhost_memory_region *reg; > > + int i; > > + /* linear search is not brilliant, but we really have on the order of 6 > > +* regions in practice */ > > + for (i = 0; i < mem->nregions; ++i) { > > + reg = mem->regions + i; > > + if (reg->guest_phys_addr <= addr && > > + reg->guest_phys_addr + reg->memory_size - 1 >= addr) > > + return reg; > > + } > > + return NULL; > > +} > > + > > +int translate_desc(struct vhost_dev *dev, u64 addr, u32 len, > > + struct iovec iov[], int iov_size) > > +{ > > + const s
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/24/2009 09:03 PM, Gregory Haskins wrote: >> >>> I don't really see how vhost and vbus are different here. vhost expects >>> signalling to happen through a couple of eventfds and requires someone >>> to supply them and implement kernel support (if needed). vbus requires >>> someone to write a connector to provide the signalling implementation. >>> Neither will work out-of-the-box when implementing virtio-net over >>> falling dominos, for example. >>> >> I realize in retrospect that my choice of words above implies vbus _is_ >> complete, but this is not what I was saying. What I was trying to >> convey is that vbus is _more_ complete. Yes, in either case some kind >> of glue needs to be written. The difference is that vbus implements >> more of the glue generally, and leaves less required to be customized >> for each iteration. >> > > > No argument there. Since you care about non-virt scenarios and virtio > doesn't, naturally vbus is a better fit for them as the code stands. Thanks for finally starting to acknowledge there's a benefit, at least. To be more precise, IMO virtio is designed to be a performance oriented ring-based driver interface that supports all types of hypervisors (e.g. shmem based kvm, and non-shmem based Xen). vbus is designed to be a high-performance generic shared-memory interconnect (for rings or otherwise) framework for environments where linux is the underpinning "host" (physical or virtual). They are distinctly different, but complementary (the former addresses the part of the front-end, and latter addresses the back-end, and a different part of the front-end). In addition, the kvm-connector used in AlacrityVM's design strives to add value and improve performance via other mechanisms, such as dynamic allocation, interrupt coalescing (thus reducing exit-ratio, which is a serious issue in KVM) and priortizable/nestable signals. Today there is a large performance disparity between what a KVM guest sees and what a native linux application sees on that same host. Just take a look at some of my graphs between "virtio", and "native", for example: http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png A dominant vbus design principle is to try to achieve the same IO performance for all "linux applications" whether they be literally userspace applications, or things like KVM vcpus or Ira's physical boards. It also aims to solve problems not previously expressible with current technologies (even virtio), like nested real-time. And even though you repeatedly insist otherwise, the neat thing here is that the two technologies mesh (at least under certain circumstances, like when virtio is deployed on a shared-memory friendly linux backend like KVM). I hope that my stack diagram below depicts that clearly. > But that's not a strong argument for vbus; instead of adding vbus you > could make virtio more friendly to non-virt Actually, it _is_ a strong argument then because adding vbus is what helps makes virtio friendly to non-virt, at least for when performance matters. > (there's a limit how far you > can take this, not imposed by the code, but by virtio's charter as a > virtual device driver framework). > >> Going back to our stack diagrams, you could think of a vhost solution >> like this: >> >> -- >> | virtio-net >> -- >> | virtio-ring >> -- >> | virtio-bus >> -- >> | ? undefined-1 ? >> -- >> | vhost >> -- >> >> and you could think of a vbus solution like this >> >> -- >> | virtio-net >> -- >> | virtio-ring >> -- >> | virtio-bus >> -- >> | bus-interface >> -- >> | ? undefined-2 ? >> -- >> | bus-model >> -- >> | virtio-net-device (vhost ported to vbus model? :) >> -- >> >> >> So the difference between vhost and vbus in this particular context is >> that you need to have "undefined-1" do device discovery/hotswap, >> config-space, address-decode/isolation, signal-path routing, memory-path >> routing, etc. Today this function is filled by things like virtio-pci, >> pci-bus, KVM/ioeventfd, and QEMU for x86. I am not as familiar with >> lguest, but presumably it is filled there by components like >> virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher. And to use >> more contemporary examples, we might have virtio-domino, domino-bus, >> domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko, >> and ira-launcher. >> >> Contrast this to the vbus stack: The bus-X components (when optionally >> employed by the connector designer) do device-discovery, hotswap, >> config-space, address-decode/isolation, signal-path and memory-path >> routing, etc in a general (and pv-centric) way.
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > What it is: vhost net is a character device that can be used to reduce > the number of system calls involved in virtio networking. > Existing virtio net code is used in the guest without modification. > > There's similarity with vringfd, with some differences and reduced scope > - uses eventfd for signalling > - structures can be moved around in memory at any time (good for migration) > - support memory table and not just an offset (needed for kvm) > > common virtio related code has been put in a separate file vhost.c and > can be made into a separate module if/when more backends appear. I used > Rusty's lguest.c as the source for developing this part : this supplied > me with witty comments I wouldn't be able to write myself. > > What it is not: vhost net is not a bus, and not a generic new system > call. No assumptions are made on how guest performs hypercalls. > Userspace hypervisors are supported as well as kvm. > > How it works: Basically, we connect virtio frontend (configured by > userspace) to a backend. The backend could be a network device, or a > tun-like device. In this version I only support raw socket as a backend, > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > also configured by userspace, including vlan/mac etc. > > Status: > This works for me, and I haven't see any crashes. > I have done some light benchmarking (with v4), compared to userspace, I > see improved latency (as I save up to 4 system calls per packet) but not > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > ping benchmark (where there's no TSO) troughput is also improved. > > Features that I plan to look at in the future: > - tap support > - TSO > - interrupt mitigation > - zero copy > > Acked-by: Arnd Bergmann > Signed-off-by: Michael S. Tsirkin > > --- > MAINTAINERS| 10 + > arch/x86/kvm/Kconfig |1 + > drivers/Makefile |1 + > drivers/vhost/Kconfig | 11 + > drivers/vhost/Makefile |2 + > drivers/vhost/net.c| 475 ++ > drivers/vhost/vhost.c | 688 > > drivers/vhost/vhost.h | 122 > include/linux/Kbuild |1 + > include/linux/miscdevice.h |1 + > include/linux/vhost.h | 101 +++ > 11 files changed, 1413 insertions(+), 0 deletions(-) > create mode 100644 drivers/vhost/Kconfig > create mode 100644 drivers/vhost/Makefile > create mode 100644 drivers/vhost/net.c > create mode 100644 drivers/vhost/vhost.c > create mode 100644 drivers/vhost/vhost.h > create mode 100644 include/linux/vhost.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index b1114cf..de4587f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5431,6 +5431,16 @@ S: Maintained > F: Documentation/filesystems/vfat.txt > F: fs/fat/ > > +VIRTIO HOST (VHOST) > +P: Michael S. Tsirkin > +M: m...@redhat.com > +L: k...@vger.kernel.org > +L: virtualizat...@lists.osdl.org > +L: net...@vger.kernel.org > +S: Maintained > +F: drivers/vhost/ > +F: include/linux/vhost.h > + > VIA RHINE NETWORK DRIVER > M: Roger Luethi > S: Maintained > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index b84e571..94f44d9 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -64,6 +64,7 @@ config KVM_AMD > > # OK, it's a little counter-intuitive to do this, but it puts it neatly under > # the virtualization menu. > +source drivers/vhost/Kconfig > source drivers/lguest/Kconfig > source drivers/virtio/Kconfig > > diff --git a/drivers/Makefile b/drivers/Makefile > index bc4205d..1551ae1 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ > obj-$(CONFIG_PPC_PS3)+= ps3/ > obj-$(CONFIG_OF) += of/ > obj-$(CONFIG_SSB)+= ssb/ > +obj-$(CONFIG_VHOST_NET) += vhost/ > obj-$(CONFIG_VIRTIO) += virtio/ > obj-$(CONFIG_VLYNQ) += vlynq/ > obj-$(CONFIG_STAGING)+= staging/ > diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig > new file mode 100644 > index 000..d955406 > --- /dev/null > +++ b/drivers/vhost/Kconfig > @@ -0,0 +1,11 @@ > +config VHOST_NET > + tristate "Host kernel accelerator for virtio net" > + depends on NET && EVENTFD > + ---help--- > + This kernel module can be loaded in host kernel to accelerate > + guest networking with virtio_net. Not to be confused with virtio_net > + module itself which needs to be loaded in guest kernel. > + > + To compile this driver as a module, choose M here: the module will > + be called vhost_net. > + > diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile > new file mode 100644 > index 000..72dd020 > --- /dev/null > +++ b/drivers/vhost/Makefile > @@ -0,0 +1,2 @@ > +obj-$(C
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/24/2009 09:03 PM, Gregory Haskins wrote: > >> I don't really see how vhost and vbus are different here. vhost expects >> signalling to happen through a couple of eventfds and requires someone >> to supply them and implement kernel support (if needed). vbus requires >> someone to write a connector to provide the signalling implementation. >> Neither will work out-of-the-box when implementing virtio-net over >> falling dominos, for example. >> > I realize in retrospect that my choice of words above implies vbus _is_ > complete, but this is not what I was saying. What I was trying to > convey is that vbus is _more_ complete. Yes, in either case some kind > of glue needs to be written. The difference is that vbus implements > more of the glue generally, and leaves less required to be customized > for each iteration. > No argument there. Since you care about non-virt scenarios and virtio doesn't, naturally vbus is a better fit for them as the code stands. But that's not a strong argument for vbus; instead of adding vbus you could make virtio more friendly to non-virt (there's a limit how far you can take this, not imposed by the code, but by virtio's charter as a virtual device driver framework). > Going back to our stack diagrams, you could think of a vhost solution > like this: > > -- > | virtio-net > -- > | virtio-ring > -- > | virtio-bus > -- > | ? undefined-1 ? > -- > | vhost > -- > > and you could think of a vbus solution like this > > -- > | virtio-net > -- > | virtio-ring > -- > | virtio-bus > -- > | bus-interface > -- > | ? undefined-2 ? > -- > | bus-model > -- > | virtio-net-device (vhost ported to vbus model? :) > -- > > > So the difference between vhost and vbus in this particular context is > that you need to have "undefined-1" do device discovery/hotswap, > config-space, address-decode/isolation, signal-path routing, memory-path > routing, etc. Today this function is filled by things like virtio-pci, > pci-bus, KVM/ioeventfd, and QEMU for x86. I am not as familiar with > lguest, but presumably it is filled there by components like > virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher. And to use > more contemporary examples, we might have virtio-domino, domino-bus, > domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko, > and ira-launcher. > > Contrast this to the vbus stack: The bus-X components (when optionally > employed by the connector designer) do device-discovery, hotswap, > config-space, address-decode/isolation, signal-path and memory-path > routing, etc in a general (and pv-centric) way. The "undefined-2" > portion is the "connector", and just needs to convey messages like > "DEVCALL" and "SHMSIGNAL". The rest is handled in other parts of the stack. > > Right. virtio assumes that it's in a virt scenario and that the guest architecture already has enumeration and hotplug mechanisms which it would prefer to use. That happens to be the case for kvm/x86. > So to answer your question, the difference is that the part that has to > be customized in vbus should be a fraction of what needs to be > customized with vhost because it defines more of the stack. But if you want to use the native mechanisms, vbus doesn't have any added value. > And, as > eluded to in my diagram, both virtio-net and vhost (with some > modifications to fit into the vbus framework) are potentially > complementary, not competitors. > Only theoretically. The existing installed base would have to be thrown away, or we'd need to support both. >> Without a vbus-connector-falling-dominos, vbus-venet can't do anything >> either. >> > Mostly covered above... > > However, I was addressing your assertion that vhost somehow magically > accomplishes this "container/addressing" function without any specific > kernel support. This is incorrect. I contend that this kernel support > is required and present. The difference is that its defined elsewhere > (and typically in a transport/arch specific way). > > IOW: You can basically think of the programmed PIO addresses as forming > its "container". Only addresses explicitly added are visible, and > everything else is inaccessible. This whole discussion is merely a > question of what's been generalized verses what needs to be > re-implemented each time. > Sorry, this is too abstract for me. >> vbus doesn't do kvm guest address decoding for the fast path. It's >> still done by ioeventfd. >> > That is not correct. vbus does its own native address decoding in the > fast path, such as here: > > http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/l
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/24/2009 10:27 PM, Ira W. Snyder wrote: >>> Ira can make ira-bus, and ira-eventfd, etc, etc. >>> >>> Each iteration will invariably introduce duplicated parts of the stack. >>> >>> >> Invariably? Use libraries (virtio-shmem.ko, libvhost.so). >> >> > Referencing libraries that don't yet exist doesn't seem like a good > argument against vbus from my point of view. I'm not speficially > advocating for vbus; I'm just letting you know how it looks to another > developer in the trenches. > My argument is that we shouldn't write a new framework instead of fixing or extending an existing one. > If you'd like to see the amount of duplication present, look at the code > I'm currently working on. Yes, virtio-phys-guest looks pretty much duplicated. Looks like it should be pretty easy to deduplicate. > It mostly works at this point, though I > haven't finished my userspace, nor figured out how to actually transfer > data. > > The current question I have (just to let you know where I am in > development) is: > > I have the physical address of the remote data, but how do I get it into > a userspace buffer, so I can pass it to tun? > vhost does guest physical address to host userspace address (it your scenario, remote physical to local virtual) using a table of memory slots; there's an ioctl that allows userspace to initialize that table. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote: > On 09/24/2009 12:15 AM, Gregory Haskins wrote: > > > >>> There are various aspects about designing high-performance virtual > >>> devices such as providing the shortest paths possible between the > >>> physical resources and the consumers. Conversely, we also need to > >>> ensure that we meet proper isolation/protection guarantees at the same > >>> time. What this means is there are various aspects to any > >>> high-performance PV design that require to be placed in-kernel to > >>> maximize the performance yet properly isolate the guest. > >>> > >>> For instance, you are required to have your signal-path (interrupts and > >>> hypercalls), your memory-path (gpa translation), and > >>> addressing/isolation model in-kernel to maximize performance. > >>> > >>> > >> Exactly. That's what vhost puts into the kernel and nothing more. > >> > > Actually, no. Generally, _KVM_ puts those things into the kernel, and > > vhost consumes them. Without KVM (or something equivalent), vhost is > > incomplete. One of my goals with vbus is to generalize the "something > > equivalent" part here. > > > > I don't really see how vhost and vbus are different here. vhost expects > signalling to happen through a couple of eventfds and requires someone > to supply them and implement kernel support (if needed). vbus requires > someone to write a connector to provide the signalling implementation. > Neither will work out-of-the-box when implementing virtio-net over > falling dominos, for example. > > >>> Vbus accomplishes its in-kernel isolation model by providing a > >>> "container" concept, where objects are placed into this container by > >>> userspace. The host kernel enforces isolation/protection by using a > >>> namespace to identify objects that is only relevant within a specific > >>> container's context (namely, a "u32 dev-id"). The guest addresses the > >>> objects by its dev-id, and the kernel ensures that the guest can't > >>> access objects outside of its dev-id namespace. > >>> > >>> > >> vhost manages to accomplish this without any kernel support. > >> > > No, vhost manages to accomplish this because of KVMs kernel support > > (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a > > merely a kind of "tuntap"-like clone signalled by eventfds. > > > > Without a vbus-connector-falling-dominos, vbus-venet can't do anything > either. Both vhost and vbus need an interface, vhost's is just narrower > since it doesn't do configuration or enumeration. > > > This goes directly to my rebuttal of your claim that vbus places too > > much in the kernel. I state that, one way or the other, address decode > > and isolation _must_ be in the kernel for performance. Vbus does this > > with a devid/container scheme. vhost+virtio-pci+kvm does it with > > pci+pio+ioeventfd. > > > > vbus doesn't do kvm guest address decoding for the fast path. It's > still done by ioeventfd. > > >> The guest > >> simply has not access to any vhost resources other than the guest->host > >> doorbell, which is handed to the guest outside vhost (so it's somebody > >> else's problem, in userspace). > >> > > You mean _controlled_ by userspace, right? Obviously, the other side of > > the kernel still needs to be programmed (ioeventfd, etc). Otherwise, > > vhost would be pointless: e.g. just use vanilla tuntap if you don't need > > fast in-kernel decoding. > > > > Yes (though for something like level-triggered interrupts we're probably > keeping it in userspace, enjoying the benefits of vhost data path while > paying more for signalling). > > >>> All that is required is a way to transport a message with a "devid" > >>> attribute as an address (such as DEVCALL(devid)) and the framework > >>> provides the rest of the decode+execute function. > >>> > >>> > >> vhost avoids that. > >> > > No, it doesn't avoid it. It just doesn't specify how its done, and > > relies on something else to do it on its behalf. > > > > That someone else can be in userspace, apart from the actual fast path. > > > Conversely, vbus specifies how its done, but not how to transport the > > verb "across the wire". That is the role of the vbus-connector abstraction. > > > > So again, vbus does everything in the kernel (since it's so easy and > cheap) but expects a vbus-connector. vhost does configuration in > userspace (since it's so clunky and fragile) but expects a couple of > eventfds. > > >>> Contrast this to vhost+virtio-pci (called simply "vhost" from here). > >>> > >>> > >> It's the wrong name. vhost implements only the data path. > >> > > Understood, but vhost+virtio-pci is what I am contrasting, and I use > > "vhost" for short from that point on because I am too lazy to type the > > whole name over and over ;) > > > > If you #define A A+B+C don't expect intelligent conversation a
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/23/2009 10:37 PM, Avi Kivity wrote: >> >> Example: feature negotiation. If it happens in userspace, it's easy >> to limit what features we expose to the guest. If it happens in the >> kernel, we need to add an interface to let the kernel know which >> features it should expose to the guest. We also need to add an >> interface to let userspace know which features were negotiated, if we >> want to implement live migration. Something fairly trivial bloats >> rapidly. > > btw, we have this issue with kvm reporting cpuid bits to the guest. > Instead of letting kvm talk directly to the hardware and the guest, kvm > gets the cpuid bits from the hardware, strips away features it doesn't > support, exposes that to userspace, and expects userspace to program the > cpuid bits it wants to expose to the guest (which may be different than > what kvm exposed to userspace, and different from guest to guest). > This issue doesn't exist in the model I am referring to, as these are all virtual-devices anyway. See my last reply -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/24/2009 12:15 AM, Gregory Haskins wrote: >> There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV design that require to be placed in-kernel to maximize the performance yet properly isolate the guest. For instance, you are required to have your signal-path (interrupts and hypercalls), your memory-path (gpa translation), and addressing/isolation model in-kernel to maximize performance. >>> Exactly. That's what vhost puts into the kernel and nothing more. >>> >> Actually, no. Generally, _KVM_ puts those things into the kernel, and >> vhost consumes them. Without KVM (or something equivalent), vhost is >> incomplete. One of my goals with vbus is to generalize the "something >> equivalent" part here. >> > > I don't really see how vhost and vbus are different here. vhost expects > signalling to happen through a couple of eventfds and requires someone > to supply them and implement kernel support (if needed). vbus requires > someone to write a connector to provide the signalling implementation. > Neither will work out-of-the-box when implementing virtio-net over > falling dominos, for example. I realize in retrospect that my choice of words above implies vbus _is_ complete, but this is not what I was saying. What I was trying to convey is that vbus is _more_ complete. Yes, in either case some kind of glue needs to be written. The difference is that vbus implements more of the glue generally, and leaves less required to be customized for each iteration. Going back to our stack diagrams, you could think of a vhost solution like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined-1 ? -- | vhost -- and you could think of a vbus solution like this -- | virtio-net -- | virtio-ring -- | virtio-bus -- | bus-interface -- | ? undefined-2 ? -- | bus-model -- | virtio-net-device (vhost ported to vbus model? :) -- So the difference between vhost and vbus in this particular context is that you need to have "undefined-1" do device discovery/hotswap, config-space, address-decode/isolation, signal-path routing, memory-path routing, etc. Today this function is filled by things like virtio-pci, pci-bus, KVM/ioeventfd, and QEMU for x86. I am not as familiar with lguest, but presumably it is filled there by components like virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher. And to use more contemporary examples, we might have virtio-domino, domino-bus, domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko, and ira-launcher. Contrast this to the vbus stack: The bus-X components (when optionally employed by the connector designer) do device-discovery, hotswap, config-space, address-decode/isolation, signal-path and memory-path routing, etc in a general (and pv-centric) way. The "undefined-2" portion is the "connector", and just needs to convey messages like "DEVCALL" and "SHMSIGNAL". The rest is handled in other parts of the stack. So to answer your question, the difference is that the part that has to be customized in vbus should be a fraction of what needs to be customized with vhost because it defines more of the stack. And, as eluded to in my diagram, both virtio-net and vhost (with some modifications to fit into the vbus framework) are potentially complementary, not competitors. > Vbus accomplishes its in-kernel isolation model by providing a "container" concept, where objects are placed into this container by userspace. The host kernel enforces isolation/protection by using a namespace to identify objects that is only relevant within a specific container's context (namely, a "u32 dev-id"). The guest addresses the objects by its dev-id, and the kernel ensures that the guest can't access objects outside of its dev-id namespace. >>> vhost manages to accomplish this without any kernel support. >>> >> No, vhost manages to accomplish this because of KVMs kernel support >> (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a >> merely a kind of "tuntap"-like clone signalled by eventfds. >> > > Without a vbus-connector-falling-dominos, vbus-venet can't do anything > either. Mostly covered above... However, I was addressing your assertion that vhost somehow m
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/23/2009 10:37 PM, Avi Kivity wrote: > > Example: feature negotiation. If it happens in userspace, it's easy > to limit what features we expose to the guest. If it happens in the > kernel, we need to add an interface to let the kernel know which > features it should expose to the guest. We also need to add an > interface to let userspace know which features were negotiated, if we > want to implement live migration. Something fairly trivial bloats > rapidly. btw, we have this issue with kvm reporting cpuid bits to the guest. Instead of letting kvm talk directly to the hardware and the guest, kvm gets the cpuid bits from the hardware, strips away features it doesn't support, exposes that to userspace, and expects userspace to program the cpuid bits it wants to expose to the guest (which may be different than what kvm exposed to userspace, and different from guest to guest). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/24/2009 12:15 AM, Gregory Haskins wrote: > >>> There are various aspects about designing high-performance virtual >>> devices such as providing the shortest paths possible between the >>> physical resources and the consumers. Conversely, we also need to >>> ensure that we meet proper isolation/protection guarantees at the same >>> time. What this means is there are various aspects to any >>> high-performance PV design that require to be placed in-kernel to >>> maximize the performance yet properly isolate the guest. >>> >>> For instance, you are required to have your signal-path (interrupts and >>> hypercalls), your memory-path (gpa translation), and >>> addressing/isolation model in-kernel to maximize performance. >>> >>> >> Exactly. That's what vhost puts into the kernel and nothing more. >> > Actually, no. Generally, _KVM_ puts those things into the kernel, and > vhost consumes them. Without KVM (or something equivalent), vhost is > incomplete. One of my goals with vbus is to generalize the "something > equivalent" part here. > I don't really see how vhost and vbus are different here. vhost expects signalling to happen through a couple of eventfds and requires someone to supply them and implement kernel support (if needed). vbus requires someone to write a connector to provide the signalling implementation. Neither will work out-of-the-box when implementing virtio-net over falling dominos, for example. >>> Vbus accomplishes its in-kernel isolation model by providing a >>> "container" concept, where objects are placed into this container by >>> userspace. The host kernel enforces isolation/protection by using a >>> namespace to identify objects that is only relevant within a specific >>> container's context (namely, a "u32 dev-id"). The guest addresses the >>> objects by its dev-id, and the kernel ensures that the guest can't >>> access objects outside of its dev-id namespace. >>> >>> >> vhost manages to accomplish this without any kernel support. >> > No, vhost manages to accomplish this because of KVMs kernel support > (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a > merely a kind of "tuntap"-like clone signalled by eventfds. > Without a vbus-connector-falling-dominos, vbus-venet can't do anything either. Both vhost and vbus need an interface, vhost's is just narrower since it doesn't do configuration or enumeration. > This goes directly to my rebuttal of your claim that vbus places too > much in the kernel. I state that, one way or the other, address decode > and isolation _must_ be in the kernel for performance. Vbus does this > with a devid/container scheme. vhost+virtio-pci+kvm does it with > pci+pio+ioeventfd. > vbus doesn't do kvm guest address decoding for the fast path. It's still done by ioeventfd. >> The guest >> simply has not access to any vhost resources other than the guest->host >> doorbell, which is handed to the guest outside vhost (so it's somebody >> else's problem, in userspace). >> > You mean _controlled_ by userspace, right? Obviously, the other side of > the kernel still needs to be programmed (ioeventfd, etc). Otherwise, > vhost would be pointless: e.g. just use vanilla tuntap if you don't need > fast in-kernel decoding. > Yes (though for something like level-triggered interrupts we're probably keeping it in userspace, enjoying the benefits of vhost data path while paying more for signalling). >>> All that is required is a way to transport a message with a "devid" >>> attribute as an address (such as DEVCALL(devid)) and the framework >>> provides the rest of the decode+execute function. >>> >>> >> vhost avoids that. >> > No, it doesn't avoid it. It just doesn't specify how its done, and > relies on something else to do it on its behalf. > That someone else can be in userspace, apart from the actual fast path. > Conversely, vbus specifies how its done, but not how to transport the > verb "across the wire". That is the role of the vbus-connector abstraction. > So again, vbus does everything in the kernel (since it's so easy and cheap) but expects a vbus-connector. vhost does configuration in userspace (since it's so clunky and fragile) but expects a couple of eventfds. >>> Contrast this to vhost+virtio-pci (called simply "vhost" from here). >>> >>> >> It's the wrong name. vhost implements only the data path. >> > Understood, but vhost+virtio-pci is what I am contrasting, and I use > "vhost" for short from that point on because I am too lazy to type the > whole name over and over ;) > If you #define A A+B+C don't expect intelligent conversation afterwards. >>> It is not immune to requiring in-kernel addressing support either, but >>> rather it just does it differently (and its not as you might expect via >>> qemu). >>> >>> Vhost relies on QEMU to render PCI objects to the guest, which the guest >>> assigns resources
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/23/2009 08:58 PM, Gregory Haskins wrote: >>> It also pulls parts of the device model into the host kernel. >>> That is the point. Most of it needs to be there for performance. >>> >> To clarify this point: >> >> There are various aspects about designing high-performance virtual >> devices such as providing the shortest paths possible between the >> physical resources and the consumers. Conversely, we also need to >> ensure that we meet proper isolation/protection guarantees at the same >> time. What this means is there are various aspects to any >> high-performance PV design that require to be placed in-kernel to >> maximize the performance yet properly isolate the guest. >> >> For instance, you are required to have your signal-path (interrupts and >> hypercalls), your memory-path (gpa translation), and >> addressing/isolation model in-kernel to maximize performance. >> > > Exactly. That's what vhost puts into the kernel and nothing more. Actually, no. Generally, _KVM_ puts those things into the kernel, and vhost consumes them. Without KVM (or something equivalent), vhost is incomplete. One of my goals with vbus is to generalize the "something equivalent" part here. I know you may not care about non-kvm use cases, and thats fine. No one says you have to. However, note that some of use do care about these non-kvm cases, and thus its a distinction I am making here as a benefit of the vbus framework. > >> Vbus accomplishes its in-kernel isolation model by providing a >> "container" concept, where objects are placed into this container by >> userspace. The host kernel enforces isolation/protection by using a >> namespace to identify objects that is only relevant within a specific >> container's context (namely, a "u32 dev-id"). The guest addresses the >> objects by its dev-id, and the kernel ensures that the guest can't >> access objects outside of its dev-id namespace. >> > > vhost manages to accomplish this without any kernel support. No, vhost manages to accomplish this because of KVMs kernel support (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a merely a kind of "tuntap"-like clone signalled by eventfds. vbus on the other hand, generalizes one more piece of the puzzle (namely, the function of pio+ioeventfd and userspace's programming of it) by presenting the devid namespace and container concept. This goes directly to my rebuttal of your claim that vbus places too much in the kernel. I state that, one way or the other, address decode and isolation _must_ be in the kernel for performance. Vbus does this with a devid/container scheme. vhost+virtio-pci+kvm does it with pci+pio+ioeventfd. > The guest > simply has not access to any vhost resources other than the guest->host > doorbell, which is handed to the guest outside vhost (so it's somebody > else's problem, in userspace). You mean _controlled_ by userspace, right? Obviously, the other side of the kernel still needs to be programmed (ioeventfd, etc). Otherwise, vhost would be pointless: e.g. just use vanilla tuntap if you don't need fast in-kernel decoding. > >> All that is required is a way to transport a message with a "devid" >> attribute as an address (such as DEVCALL(devid)) and the framework >> provides the rest of the decode+execute function. >> > > vhost avoids that. No, it doesn't avoid it. It just doesn't specify how its done, and relies on something else to do it on its behalf. Conversely, vbus specifies how its done, but not how to transport the verb "across the wire". That is the role of the vbus-connector abstraction. > >> Contrast this to vhost+virtio-pci (called simply "vhost" from here). >> > > It's the wrong name. vhost implements only the data path. Understood, but vhost+virtio-pci is what I am contrasting, and I use "vhost" for short from that point on because I am too lazy to type the whole name over and over ;) > >> It is not immune to requiring in-kernel addressing support either, but >> rather it just does it differently (and its not as you might expect via >> qemu). >> >> Vhost relies on QEMU to render PCI objects to the guest, which the guest >> assigns resources (such as BARs, interrupts, etc). > > vhost does not rely on qemu. It relies on its user to handle > configuration. In one important case it's qemu+pci. It could just as > well be the lguest launcher. I meant vhost=vhost+virtio-pci here. Sorry for the confusion. The point I am making specifically is that vhost in general relies on other in-kernel components to function. I.e. It cannot function without having something like the PCI model to build an IO namespace. That namespace (in this case, pio addresses+data tuples) are used for the in-kernel addressing function under KVM + virtio-pci. The case of the lguest launcher is a good one to highlight. Yes, you can presumably also use lguest with vhost, if the requisite facilities are expo
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/23/2009 08:58 PM, Gregory Haskins wrote: >> >>> It also pulls parts of the device model into the host kernel. >>> >> That is the point. Most of it needs to be there for performance. >> > To clarify this point: > > There are various aspects about designing high-performance virtual > devices such as providing the shortest paths possible between the > physical resources and the consumers. Conversely, we also need to > ensure that we meet proper isolation/protection guarantees at the same > time. What this means is there are various aspects to any > high-performance PV design that require to be placed in-kernel to > maximize the performance yet properly isolate the guest. > > For instance, you are required to have your signal-path (interrupts and > hypercalls), your memory-path (gpa translation), and > addressing/isolation model in-kernel to maximize performance. > Exactly. That's what vhost puts into the kernel and nothing more. > Vbus accomplishes its in-kernel isolation model by providing a > "container" concept, where objects are placed into this container by > userspace. The host kernel enforces isolation/protection by using a > namespace to identify objects that is only relevant within a specific > container's context (namely, a "u32 dev-id"). The guest addresses the > objects by its dev-id, and the kernel ensures that the guest can't > access objects outside of its dev-id namespace. > vhost manages to accomplish this without any kernel support. The guest simply has not access to any vhost resources other than the guest->host doorbell, which is handed to the guest outside vhost (so it's somebody else's problem, in userspace). > All that is required is a way to transport a message with a "devid" > attribute as an address (such as DEVCALL(devid)) and the framework > provides the rest of the decode+execute function. > vhost avoids that. > Contrast this to vhost+virtio-pci (called simply "vhost" from here). > It's the wrong name. vhost implements only the data path. > It is not immune to requiring in-kernel addressing support either, but > rather it just does it differently (and its not as you might expect via > qemu). > > Vhost relies on QEMU to render PCI objects to the guest, which the guest > assigns resources (such as BARs, interrupts, etc). vhost does not rely on qemu. It relies on its user to handle configuration. In one important case it's qemu+pci. It could just as well be the lguest launcher. >A PCI-BAR in this > example may represent a PIO address for triggering some operation in the > device-model's fast-path. For it to have meaning in the fast-path, KVM > has to have in-kernel knowledge of what a PIO-exit is, and what to do > with it (this is where pio-bus and ioeventfd come in). The programming > of the PIO-exit and the ioeventfd are likewise controlled by some > userspace management entity (i.e. qemu). The PIO address and value > tuple form the address, and the ioeventfd framework within KVM provide > the decode+execute function. > Right. > This idea seemingly works fine, mind you, but it rides on top of a *lot* > of stuff including but not limited to: the guests pci stack, the qemu > pci emulation, kvm pio support, and ioeventfd. When you get into > situations where you don't have PCI or even KVM underneath you (e.g. a > userspace container, Ira's rig, etc) trying to recreate all of that PCI > infrastructure for the sake of using PCI is, IMO, a lot of overhead for > little gain. > For the N+1th time, no. vhost is perfectly usable without pci. Can we stop raising and debunking this point? > All you really need is a simple decode+execute mechanism, and a way to > program it from userspace control. vbus tries to do just that: > commoditize it so all you need is the transport of the control messages > (like DEVCALL()), but the decode+execute itself is reuseable, even > across various environments (like KVM or Iras rig). > If you think it should be "commodotized", write libvhostconfig.so. > And your argument, I believe, is that vbus allows both to be implemented > in the kernel (though to reiterate, its optional) and is therefore a bad > design, so lets discuss that. > > I believe the assertion is that things like config-space are best left > to userspace, and we should only relegate fast-path duties to the > kernel. The problem is that, in my experience, a good deal of > config-space actually influences the fast-path and thus needs to > interact with the fast-path mechanism eventually anyway. > Whats left > over that doesn't fall into this category may cheaply ride on existing > plumbing, so its not like we created something new or unnatural just to > support this subclass of config-space. > Flexibility is reduced, because changing code in the kernel is more expensive than in userspace, and kernel/user interfaces aren't typically as wide as pure userspace interfaces. Security is reduced, since a
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Gregory Haskins wrote: > Avi Kivity wrote: >> On 09/23/2009 05:26 PM, Gregory Haskins wrote: >>> > Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, > and > virtio-s390. It isn't especially easy. I can steal lots of code from > the > lguest bus model, but sometimes it is good to generalize, especially > after the fourth implemention or so. I think this is what GHaskins > tried > to do. > > Yes. vbus is more finely layered so there is less code duplication. >>> To clarify, Ira was correct in stating this generalizing some of these >>> components was one of the goals for the vbus project: IOW vbus finely >>> layers and defines what's below virtio, not replaces it. >>> >>> You can think of a virtio-stack like this: >>> >>> -- >>> | virtio-net >>> -- >>> | virtio-ring >>> -- >>> | virtio-bus >>> -- >>> | ? undefined ? >>> -- >>> >>> IOW: The way I see it, virtio is a device interface model only. The >>> rest of it is filled in by the virtio-transport and some kind of >>> back-end. >>> >>> So today, we can complete the "? undefined ?" block like this for KVM: >>> >>> -- >>> | virtio-pci >>> -- >>> | >>> -- >>> | kvm.ko >>> -- >>> | qemu >>> -- >>> | tuntap >>> -- >>> >>> In this case, kvm.ko and tuntap are providing plumbing, and qemu is >>> providing a backend device model (pci-based, etc). >>> >>> You can, of course, plug a different stack in (such as virtio-lguest, >>> virtio-ira, etc) but you are more or less on your own to recreate many >>> of the various facilities contained in that stack (such as things >>> provided by QEMU, like discovery/hotswap/addressing), as Ira is >>> discovering. >>> >>> Vbus tries to commoditize more components in the stack (like the bus >>> model and backend-device model) so they don't need to be redesigned each >>> time we solve this "virtio-transport" problem. IOW: stop the >>> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath >>> virtio. Instead, we can then focus on the value add on top, like the >>> models themselves or the simple glue between them. >>> >>> So now you might have something like >>> >>> -- >>> | virtio-vbus >>> -- >>> | vbus-proxy >>> -- >>> | kvm-guest-connector >>> -- >>> | >>> -- >>> | kvm.ko >>> -- >>> | kvm-host-connector.ko >>> -- >>> | vbus.ko >>> -- >>> | virtio-net-backend.ko >>> -- >>> >>> so now we don't need to worry about the bus-model or the device-model >>> framework. We only need to implement the connector, etc. This is handy >>> when you find yourself in an environment that doesn't support PCI (such >>> as Ira's rig, or userspace containers), or when you want to add features >>> that PCI doesn't have (such as fluid event channels for things like IPC >>> services, or priortizable interrupts, etc). >>> >> Well, vbus does more, for example it tunnels interrupts instead of >> exposing them 1:1 on the native interface if it exists. > > As I've previously explained, that trait is a function of the > kvm-connector I've chosen to implement, not of the overall design of vbus. > > The reason why my kvm-connector is designed that way is because my early > testing/benchmarking shows one of the issues in KVM performance is the > ratio of exits per IO operation are fairly high, especially as your > scale io-load. Therefore, the connector achieves a substantial > reduction in that ratio by treating "interrupts" to the same kind of > benefits that NAPI brought to general networking: That is, we enqueue > "interrupt" messages into a lockless ring and only hit the IDT for the > first occurrence. Subsequent interrupts are injected in a > parallel/lockless manner, without hitting the IDT nor incurring an extra > EOI. This pays dividends as the IO rate increases, which is when the > guest needs the most help. > > OTOH, it is entirely possible to design the connector such that we > maintain a 1:1 ratio of signals to traditional IDT interrupts. It is > also possible to design a connector which surfaces as something else, > such as PCI devices (by terminating the connector in QEMU and utilizing > its PCI emulation facilities), which would naturally employ 1:1 mapping. > > So if 1:1 mapping is a critical feature (I would argue to the contrary), > vbus can support it. > >> It also pulls parts of the device model into the host kernel. > > That is the point. Most of it needs to be there for performance. To clarify this point: There are various a
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/23/2009 05:26 PM, Gregory Haskins wrote: >> >> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. >>> Yes. vbus is more finely layered so there is less code duplication. >>> >> To clarify, Ira was correct in stating this generalizing some of these >> components was one of the goals for the vbus project: IOW vbus finely >> layers and defines what's below virtio, not replaces it. >> >> You can think of a virtio-stack like this: >> >> -- >> | virtio-net >> -- >> | virtio-ring >> -- >> | virtio-bus >> -- >> | ? undefined ? >> -- >> >> IOW: The way I see it, virtio is a device interface model only. The >> rest of it is filled in by the virtio-transport and some kind of >> back-end. >> >> So today, we can complete the "? undefined ?" block like this for KVM: >> >> -- >> | virtio-pci >> -- >> | >> -- >> | kvm.ko >> -- >> | qemu >> -- >> | tuntap >> -- >> >> In this case, kvm.ko and tuntap are providing plumbing, and qemu is >> providing a backend device model (pci-based, etc). >> >> You can, of course, plug a different stack in (such as virtio-lguest, >> virtio-ira, etc) but you are more or less on your own to recreate many >> of the various facilities contained in that stack (such as things >> provided by QEMU, like discovery/hotswap/addressing), as Ira is >> discovering. >> >> Vbus tries to commoditize more components in the stack (like the bus >> model and backend-device model) so they don't need to be redesigned each >> time we solve this "virtio-transport" problem. IOW: stop the >> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath >> virtio. Instead, we can then focus on the value add on top, like the >> models themselves or the simple glue between them. >> >> So now you might have something like >> >> -- >> | virtio-vbus >> -- >> | vbus-proxy >> -- >> | kvm-guest-connector >> -- >> | >> -- >> | kvm.ko >> -- >> | kvm-host-connector.ko >> -- >> | vbus.ko >> -- >> | virtio-net-backend.ko >> -- >> >> so now we don't need to worry about the bus-model or the device-model >> framework. We only need to implement the connector, etc. This is handy >> when you find yourself in an environment that doesn't support PCI (such >> as Ira's rig, or userspace containers), or when you want to add features >> that PCI doesn't have (such as fluid event channels for things like IPC >> services, or priortizable interrupts, etc). >> > > Well, vbus does more, for example it tunnels interrupts instead of > exposing them 1:1 on the native interface if it exists. As I've previously explained, that trait is a function of the kvm-connector I've chosen to implement, not of the overall design of vbus. The reason why my kvm-connector is designed that way is because my early testing/benchmarking shows one of the issues in KVM performance is the ratio of exits per IO operation are fairly high, especially as your scale io-load. Therefore, the connector achieves a substantial reduction in that ratio by treating "interrupts" to the same kind of benefits that NAPI brought to general networking: That is, we enqueue "interrupt" messages into a lockless ring and only hit the IDT for the first occurrence. Subsequent interrupts are injected in a parallel/lockless manner, without hitting the IDT nor incurring an extra EOI. This pays dividends as the IO rate increases, which is when the guest needs the most help. OTOH, it is entirely possible to design the connector such that we maintain a 1:1 ratio of signals to traditional IDT interrupts. It is also possible to design a connector which surfaces as something else, such as PCI devices (by terminating the connector in QEMU and utilizing its PCI emulation facilities), which would naturally employ 1:1 mapping. So if 1:1 mapping is a critical feature (I would argue to the contrary), vbus can support it. > It also pulls parts of the device model into the host kernel. That is the point. Most of it needs to be there for performance. And what doesn't need to be there for performance can either be: a) skipped at the discretion of the connector/device-model designer OR b) included because its trivially small subset of the model (e.g. a mac
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/23/2009 05:26 PM, Gregory Haskins wrote: > > >>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and >>> virtio-s390. It isn't especially easy. I can steal lots of code from the >>> lguest bus model, but sometimes it is good to generalize, especially >>> after the fourth implemention or so. I think this is what GHaskins tried >>> to do. >>> >>> >> Yes. vbus is more finely layered so there is less code duplication. >> > To clarify, Ira was correct in stating this generalizing some of these > components was one of the goals for the vbus project: IOW vbus finely > layers and defines what's below virtio, not replaces it. > > You can think of a virtio-stack like this: > > -- > | virtio-net > -- > | virtio-ring > -- > | virtio-bus > -- > | ? undefined ? > -- > > IOW: The way I see it, virtio is a device interface model only. The > rest of it is filled in by the virtio-transport and some kind of back-end. > > So today, we can complete the "? undefined ?" block like this for KVM: > > -- > | virtio-pci > -- > | > -- > | kvm.ko > -- > | qemu > -- > | tuntap > -- > > In this case, kvm.ko and tuntap are providing plumbing, and qemu is > providing a backend device model (pci-based, etc). > > You can, of course, plug a different stack in (such as virtio-lguest, > virtio-ira, etc) but you are more or less on your own to recreate many > of the various facilities contained in that stack (such as things > provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering. > > Vbus tries to commoditize more components in the stack (like the bus > model and backend-device model) so they don't need to be redesigned each > time we solve this "virtio-transport" problem. IOW: stop the > proliferation of the need for pci-bus, lguest-bus, foo-bus underneath > virtio. Instead, we can then focus on the value add on top, like the > models themselves or the simple glue between them. > > So now you might have something like > > -- > | virtio-vbus > -- > | vbus-proxy > -- > | kvm-guest-connector > -- > | > -- > | kvm.ko > -- > | kvm-host-connector.ko > -- > | vbus.ko > -- > | virtio-net-backend.ko > -- > > so now we don't need to worry about the bus-model or the device-model > framework. We only need to implement the connector, etc. This is handy > when you find yourself in an environment that doesn't support PCI (such > as Ira's rig, or userspace containers), or when you want to add features > that PCI doesn't have (such as fluid event channels for things like IPC > services, or priortizable interrupts, etc). > Well, vbus does more, for example it tunnels interrupts instead of exposing them 1:1 on the native interface if it exists. It also pulls parts of the device model into the host kernel. >> The virtio layering was more or less dictated by Xen which doesn't have >> shared memory (it uses grant references instead). As a matter of fact >> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that >> part is duplicated. It's probably possible to add a virtio-shmem.ko >> library that people who do have shared memory can reuse. >> > Note that I do not believe the Xen folk use virtio, so while I can > appreciate the foresight that went into that particular aspect of the > design of the virtio model, I am not sure if its a realistic constraint. > Since a virtio goal was to reduce virtual device driver proliferation, it was necessary to accommodate Xen. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/22/2009 12:43 AM, Ira W. Snyder wrote: >> >>> Sure, virtio-ira and he is on his own to make a bus-model under that, or >>> virtio-vbus + vbus-ira-connector to use the vbus framework. Either >>> model can work, I agree. >>> >>> >> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and >> virtio-s390. It isn't especially easy. I can steal lots of code from the >> lguest bus model, but sometimes it is good to generalize, especially >> after the fourth implemention or so. I think this is what GHaskins tried >> to do. >> > > Yes. vbus is more finely layered so there is less code duplication. To clarify, Ira was correct in stating this generalizing some of these components was one of the goals for the vbus project: IOW vbus finely layers and defines what's below virtio, not replaces it. You can think of a virtio-stack like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined ? -- IOW: The way I see it, virtio is a device interface model only. The rest of it is filled in by the virtio-transport and some kind of back-end. So today, we can complete the "? undefined ?" block like this for KVM: -- | virtio-pci -- | -- | kvm.ko -- | qemu -- | tuntap -- In this case, kvm.ko and tuntap are providing plumbing, and qemu is providing a backend device model (pci-based, etc). You can, of course, plug a different stack in (such as virtio-lguest, virtio-ira, etc) but you are more or less on your own to recreate many of the various facilities contained in that stack (such as things provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering. Vbus tries to commoditize more components in the stack (like the bus model and backend-device model) so they don't need to be redesigned each time we solve this "virtio-transport" problem. IOW: stop the proliferation of the need for pci-bus, lguest-bus, foo-bus underneath virtio. Instead, we can then focus on the value add on top, like the models themselves or the simple glue between them. So now you might have something like -- | virtio-vbus -- | vbus-proxy -- | kvm-guest-connector -- | -- | kvm.ko -- | kvm-host-connector.ko -- | vbus.ko -- | virtio-net-backend.ko -- so now we don't need to worry about the bus-model or the device-model framework. We only need to implement the connector, etc. This is handy when you find yourself in an environment that doesn't support PCI (such as Ira's rig, or userspace containers), or when you want to add features that PCI doesn't have (such as fluid event channels for things like IPC services, or priortizable interrupts, etc). > > The virtio layering was more or less dictated by Xen which doesn't have > shared memory (it uses grant references instead). As a matter of fact > lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that > part is duplicated. It's probably possible to add a virtio-shmem.ko > library that people who do have shared memory can reuse. Note that I do not believe the Xen folk use virtio, so while I can appreciate the foresight that went into that particular aspect of the design of the virtio model, I am not sure if its a realistic constraint. The reason why I decided to not worry about that particular model is twofold: 1) Trying to support non shared-memory designs is prohibitively high for my performance goals (for instance, requiring an exit on each ->add_buf() in addition to the ->kick()). 2) The Xen guys are unlikely to diverge from something like xenbus/xennet anyway, so it would be for naught. Therefore, I just went with a device model optimized for shared-memory outright. That said, I believe we can refactor what is called the "vbus-proxy-device" into this virtio-shmem interface that you and Anthony have described. We could make the feature optional and only support on architectures where this makes sense. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/22/2009 06:25 PM, Ira W. Snyder wrote: > >> Yes. vbus is more finely layered so there is less code duplication. >> >> The virtio layering was more or less dictated by Xen which doesn't have >> shared memory (it uses grant references instead). As a matter of fact >> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that >> part is duplicated. It's probably possible to add a virtio-shmem.ko >> library that people who do have shared memory can reuse. >> >> > Seems like a nice benefit of vbus. > Yes, it is. With some work virtio can gain that too (virtio-shmem.ko). >>> I've given it some thought, and I think that running vhost-net (or >>> similar) on the ppc boards, with virtio-net on the x86 crate server will >>> work. The virtio-ring abstraction is almost good enough to work for this >>> situation, but I had to re-invent it to work with my boards. >>> >>> I've exposed a 16K region of memory as PCI BAR1 from my ppc board. >>> Remember that this is the "host" system. I used each 4K block as a >>> "device descriptor" which contains: >>> >>> 1) the type of device, config space, etc. for virtio >>> 2) the "desc" table (virtio memory descriptors, see virtio-ring) >>> 3) the "avail" table (available entries in the desc table) >>> >>> >> Won't access from x86 be slow to this memory (on the other hand, if you >> change it to main memory access from ppc will be slow... really depends >> on how your system is tuned. >> >> > Writes across the bus are fast, reads across the bus are slow. These are > just the descriptor tables for memory buffers, not the physical memory > buffers themselves. > > These only need to be written by the guest (x86), and read by the host > (ppc). The host never changes the tables, so we can cache a copy in the > guest, for a fast detach_buf() implementation (see virtio-ring, which > I'm copying the design from). > > The only accesses are writes across the PCI bus. There is never a need > to do a read (except for slow-path configuration). > Okay, sounds like what you're doing it optimal then. > In the spirit of "post early and often", I'm making my code available, > that's all. I'm asking anyone interested for some review, before I have > to re-code this for about the fifth time now. I'm trying to avoid > Haskins' situation, where he's invented and debugged a lot of new code, > and then been told to do it completely differently. > > Yes, the code I posted is only compile-tested, because quite a lot of > code (kernel and userspace) must be working before anything works at > all. I hate to design the whole thing, then be told that something > fundamental about it is wrong, and have to completely re-write it. > Understood. Best to get a review from Rusty then. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote: > On 09/22/2009 12:43 AM, Ira W. Snyder wrote: > > > >> Sure, virtio-ira and he is on his own to make a bus-model under that, or > >> virtio-vbus + vbus-ira-connector to use the vbus framework. Either > >> model can work, I agree. > >> > >> > > Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and > > virtio-s390. It isn't especially easy. I can steal lots of code from the > > lguest bus model, but sometimes it is good to generalize, especially > > after the fourth implemention or so. I think this is what GHaskins tried > > to do. > > > > Yes. vbus is more finely layered so there is less code duplication. > > The virtio layering was more or less dictated by Xen which doesn't have > shared memory (it uses grant references instead). As a matter of fact > lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that > part is duplicated. It's probably possible to add a virtio-shmem.ko > library that people who do have shared memory can reuse. > Seems like a nice benefit of vbus. > > I've given it some thought, and I think that running vhost-net (or > > similar) on the ppc boards, with virtio-net on the x86 crate server will > > work. The virtio-ring abstraction is almost good enough to work for this > > situation, but I had to re-invent it to work with my boards. > > > > I've exposed a 16K region of memory as PCI BAR1 from my ppc board. > > Remember that this is the "host" system. I used each 4K block as a > > "device descriptor" which contains: > > > > 1) the type of device, config space, etc. for virtio > > 2) the "desc" table (virtio memory descriptors, see virtio-ring) > > 3) the "avail" table (available entries in the desc table) > > > > Won't access from x86 be slow to this memory (on the other hand, if you > change it to main memory access from ppc will be slow... really depends > on how your system is tuned. > Writes across the bus are fast, reads across the bus are slow. These are just the descriptor tables for memory buffers, not the physical memory buffers themselves. These only need to be written by the guest (x86), and read by the host (ppc). The host never changes the tables, so we can cache a copy in the guest, for a fast detach_buf() implementation (see virtio-ring, which I'm copying the design from). The only accesses are writes across the PCI bus. There is never a need to do a read (except for slow-path configuration). > > Parts 2 and 3 are repeated three times, to allow for a maximum of three > > virtqueues per device. This is good enough for all current drivers. > > > > The plan is to switch to multiqueue soon. Will not affect you if your > boards are uniprocessor or small smp. > Everything I have is UP. I don't need extreme performance, either. 40MB/sec is the minimum I need to reach, though I'd like to have some headroom. For reference, using the CPU to handle data transfers, I get ~2MB/sec transfers. Using the DMA engine, I've hit about 60MB/sec with my "crossed-wires" virtio-net. > > I've gotten plenty of email about this from lots of interested > > developers. There are people who would like this kind of system to just > > work, while having to write just some glue for their device, just like a > > network driver. I hunch most people have created some proprietary mess > > that basically works, and left it at that. > > > > So long as you keep the system-dependent features hookable or > configurable, it should work. > > > So, here is a desperate cry for help. I'd like to make this work, and > > I'd really like to see it in mainline. I'm trying to give back to the > > community from which I've taken plenty. > > > > Not sure who you're crying for help to. Once you get this working, post > patches. If the patches are reasonably clean and don't impact > performance for the main use case, and if you can show the need, I > expect they'll be merged. > In the spirit of "post early and often", I'm making my code available, that's all. I'm asking anyone interested for some review, before I have to re-code this for about the fifth time now. I'm trying to avoid Haskins' situation, where he's invented and debugged a lot of new code, and then been told to do it completely differently. Yes, the code I posted is only compile-tested, because quite a lot of code (kernel and userspace) must be working before anything works at all. I hate to design the whole thing, then be told that something fundamental about it is wrong, and have to completely re-write it. Thanks for the comments, Ira > -- > error compiling committee.c: too many arguments to function > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/22/2009 12:43 AM, Ira W. Snyder wrote: > >> Sure, virtio-ira and he is on his own to make a bus-model under that, or >> virtio-vbus + vbus-ira-connector to use the vbus framework. Either >> model can work, I agree. >> >> > Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and > virtio-s390. It isn't especially easy. I can steal lots of code from the > lguest bus model, but sometimes it is good to generalize, especially > after the fourth implemention or so. I think this is what GHaskins tried > to do. > Yes. vbus is more finely layered so there is less code duplication. The virtio layering was more or less dictated by Xen which doesn't have shared memory (it uses grant references instead). As a matter of fact lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that part is duplicated. It's probably possible to add a virtio-shmem.ko library that people who do have shared memory can reuse. > I've given it some thought, and I think that running vhost-net (or > similar) on the ppc boards, with virtio-net on the x86 crate server will > work. The virtio-ring abstraction is almost good enough to work for this > situation, but I had to re-invent it to work with my boards. > > I've exposed a 16K region of memory as PCI BAR1 from my ppc board. > Remember that this is the "host" system. I used each 4K block as a > "device descriptor" which contains: > > 1) the type of device, config space, etc. for virtio > 2) the "desc" table (virtio memory descriptors, see virtio-ring) > 3) the "avail" table (available entries in the desc table) > Won't access from x86 be slow to this memory (on the other hand, if you change it to main memory access from ppc will be slow... really depends on how your system is tuned. > Parts 2 and 3 are repeated three times, to allow for a maximum of three > virtqueues per device. This is good enough for all current drivers. > The plan is to switch to multiqueue soon. Will not affect you if your boards are uniprocessor or small smp. > I've gotten plenty of email about this from lots of interested > developers. There are people who would like this kind of system to just > work, while having to write just some glue for their device, just like a > network driver. I hunch most people have created some proprietary mess > that basically works, and left it at that. > So long as you keep the system-dependent features hookable or configurable, it should work. > So, here is a desperate cry for help. I'd like to make this work, and > I'd really like to see it in mainline. I'm trying to give back to the > community from which I've taken plenty. > Not sure who you're crying for help to. Once you get this working, post patches. If the patches are reasonably clean and don't impact performance for the main use case, and if you can show the need, I expect they'll be merged. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote: > Avi Kivity wrote: > > On 09/16/2009 10:22 PM, Gregory Haskins wrote: > >> Avi Kivity wrote: > >> > >>> On 09/16/2009 05:10 PM, Gregory Haskins wrote: > >>> > > If kvm can do it, others can. > > > > > The problem is that you seem to either hand-wave over details like > this, > or you give details that are pretty much exactly what vbus does > already. > My point is that I've already sat down and thought about these > issues > and solved them in a freely available GPL'ed software package. > > > >>> In the kernel. IMO that's the wrong place for it. > >>> > >> 3) "in-kernel": You can do something like virtio-net to vhost to > >> potentially meet some of the requirements, but not all. > >> > >> In order to fully meet (3), you would need to do some of that stuff you > >> mentioned in the last reply with muxing device-nr/reg-nr. In addition, > >> we need to have a facility for mapping eventfds and establishing a > >> signaling mechanism (like PIO+qid), etc. KVM does this with > >> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be > >> invented. > >> > > > > irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. > > Not per se, but it needs to be interfaced. How do I register that > eventfd with the fastpath in Ira's rig? How do I signal the eventfd > (x86->ppc, and ppc->x86)? > Sorry to reply so late to this thread, I've been on vacation for the past week. If you'd like to continue in another thread, please start it and CC me. On the PPC, I've got a hardware "doorbell" register which generates 30 distiguishable interrupts over the PCI bus. I have outbound and inbound registers, which can be used to signal the "other side". I assume it isn't too much code to signal an eventfd in an interrupt handler. I haven't gotten to this point in the code yet. > To take it to the next level, how do I organize that mechanism so that > it works for more than one IO-stream (e.g. address the various queues > within ethernet or a different device like the console)? KVM has > IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not > have the luxury of an established IO paradigm. > > Is vbus the only way to implement a solution? No. But it is _a_ way, > and its one that was specifically designed to solve this very problem > (as well as others). > > (As an aside, note that you generally will want an abstraction on top of > irqfd/eventfd like shm-signal or virtqueues to do shared-memory based > event mitigation, but I digress. That is a separate topic). > > > > >> To meet performance, this stuff has to be in kernel and there has to be > >> a way to manage it. > > > > and management belongs in userspace. > > vbus does not dictate where the management must be. Its an extensible > framework, governed by what you plug into it (ala connectors and devices). > > For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD > and DEVDROP hotswap events into the interrupt stream, because they are > simple and we already needed the interrupt stream anyway for fast-path. > > As another example: venet chose to put ->call(MACQUERY) "config-space" > into its call namespace because its simple, and we already need > ->calls() for fastpath. It therefore exports an attribute to sysfs that > allows the management app to set it. > > I could likewise have designed the connector or device-model differently > as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI > userspace) but this seems silly to me when they are so trivial, so I didn't. > > > > >> Since vbus was designed to do exactly that, this is > >> what I would advocate. You could also reinvent these concepts and put > >> your own mux and mapping code in place, in addition to all the other > >> stuff that vbus does. But I am not clear why anyone would want to. > >> > > > > Maybe they like their backward compatibility and Windows support. > > This is really not relevant to this thread, since we are talking about > Ira's hardware. But if you must bring this up, then I will reiterate > that you just design the connector to interface with QEMU+PCI and you > have that too if that was important to you. > > But on that topic: Since you could consider KVM a "motherboard > manufacturer" of sorts (it just happens to be virtual hardware), I don't > know why KVM seems to consider itself the only motherboard manufacturer > in the world that has to make everything look legacy. If a company like > ASUS wants to add some cutting edge IO controller/bus, they simply do > it. Pretty much every product release may contain a different array of > devices, many of which are not backwards compatible with any prior > silicon. The guy/gal installing Windows on that system may see a "?" in > device-manager until they load a driver that supports the
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 10:11 PM, Gregory Haskins wrote: > It is certainly not a requirement to make said > chip somehow work with existing drivers/facilities on bare metal, per > se. Why should virtual systems be different? i'd guess it's an issue of support resources. a hardware developer creates a chip and immediately sells it, getting small but assured revenue, with it they write (or pays to write) drivers for a couple of releases, and stop to manufacture it as soon as it's not profitable. software has a much longer lifetime, especially at the platform-level (and KVM is a platform for a lot of us). also, being GPL, it's cheaper to produce but has (much!) more limited resources. creating a new support issue is a scary thought. -- Javier ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/17/2009 06:11 AM, Gregory Haskins wrote: > >> irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. >> > Not per se, but it needs to be interfaced. How do I register that > eventfd with the fastpath in Ira's rig? How do I signal the eventfd > (x86->ppc, and ppc->x86)? > You write a userspace or kernel module to do it. It's a few dozen lines of code. > To take it to the next level, how do I organize that mechanism so that > it works for more than one IO-stream (e.g. address the various queues > within ethernet or a different device like the console)? KVM has > IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not > have the luxury of an established IO paradigm. > > Is vbus the only way to implement a solution? No. But it is _a_ way, > and its one that was specifically designed to solve this very problem > (as well as others). > virtio assumes that the number of transports will be limited and interesting growth is in the number of device classes and drivers. So we have support for just three transports, but 6 device classes (9p, rng, balloon, console, blk, net) and 8 drivers (the preceding 6 for linux, plus blk/net for Windows). It would have nice to be able to write a new binding in Visual Basic but it's hardly a killer feature. >>> Since vbus was designed to do exactly that, this is >>> what I would advocate. You could also reinvent these concepts and put >>> your own mux and mapping code in place, in addition to all the other >>> stuff that vbus does. But I am not clear why anyone would want to. >>> >>> >> Maybe they like their backward compatibility and Windows support. >> > This is really not relevant to this thread, since we are talking about > Ira's hardware. But if you must bring this up, then I will reiterate > that you just design the connector to interface with QEMU+PCI and you > have that too if that was important to you. > Well, for Ira the major issue is probably inclusion in the upstream kernel. > But on that topic: Since you could consider KVM a "motherboard > manufacturer" of sorts (it just happens to be virtual hardware), I don't > know why KVM seems to consider itself the only motherboard manufacturer > in the world that has to make everything look legacy. If a company like > ASUS wants to add some cutting edge IO controller/bus, they simply do > it. No, they don't. New buses are added through industry consortiums these days. No one adds a bus that is only available with their machine, not even Apple. > Pretty much every product release may contain a different array of > devices, many of which are not backwards compatible with any prior > silicon. The guy/gal installing Windows on that system may see a "?" in > device-manager until they load a driver that supports the new chip, and > subsequently it works. It is certainly not a requirement to make said > chip somehow work with existing drivers/facilities on bare metal, per > se. Why should virtual systems be different? > Devices/drivers are a different matter, and if you have a virtio-net device you'll get the same "?" until you load the driver. That's how people and the OS vendors expect things to work. > What I was getting at is that you can't just hand-wave the datapath > stuff. We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do > device discovery/addressing with PCI. That's not datapath stuff. > Neither of those are available > here in Ira's case yet the general concepts are needed. Therefore, we > have to come up with something else. > Ira has to implement virtio's ->kick() function and come up with something for discovery. It's a lot less lines of code than there are messages in this thread. >> Yes. I'm all for reusing virtio, but I'm not going switch to vbus or >> support both for this esoteric use case. >> > With all due respect, no one asked you to. This sub-thread was > originally about using vhost in Ira's rig. When problems surfaced in > that proposed model, I highlighted that I had already addressed that > problem in vbus, and here we are. > Ah, okay. I have no interest in Ira choosing either virtio or vbus. >> vhost-net somehow manages to work without the config stuff in the kernel. >> > I was referring to data-path stuff, like signal and memory > configuration/routing. > signal and memory configuration/routing are not data-path stuff. >> Well, virtio has a similar abstraction on the guest side. The host side >> abstraction is limited to signalling since all configuration is in >> userspace. vhost-net ought to work for lguest and s390 without change. >> > But IIUC that is primarily because the revectoring work is already in > QEMU for virtio-u and it rides on that, right? Not knocking that, thats > nice and a distinct advantage. It should just be noted that its based > on sunk-cost, and not truly free. Its just already paid for, which is > different.
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote: >>> There is no role reversal. >> So if I have virtio-blk driver running on the x86 and vhost-blk device >> running on the ppc board, I can use the ppc board as a block-device. >> What if I really wanted to go the other way? > > It seems ppc is the only one that can initiate DMA to an arbitrary > address, so you can't do this really, or you can by tunneling each > request back to ppc, or doing an extra data copy, but it's unlikely to > work well. > > The limitation comes from hardware, not from the API we use. Understood, but presumably it can be exposed as a sub-function of the ppc's board's register file as a DMA-controller service to the x86. This would fall into the "tunnel requests back" category you mention above, though I think "tunnel" implies a heavier protocol than it would actually require. This would look more like a PIO cycle to a DMA controller than some higher layer protocol. You would then utilize that DMA service inside the memctx, and it the rest of vbus would work transparently with the existing devices/drivers. I do agree it would require some benchmarking to determine its feasibility, which is why I was careful to say things like "may work" ;). I also do not even know if its possible to expose the service this way on his system. If this design is not possible or performs poorly, I admit vbus is just as hosed as vhost in regard to the "role correction" benefit. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote: > > There is no role reversal. > > So if I have virtio-blk driver running on the x86 and vhost-blk device > running on the ppc board, I can use the ppc board as a block-device. > What if I really wanted to go the other way? It seems ppc is the only one that can initiate DMA to an arbitrary address, so you can't do this really, or you can by tunneling each request back to ppc, or doing an extra data copy, but it's unlikely to work well. The limitation comes from hardware, not from the API we use. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/16/2009 10:22 PM, Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> On 09/16/2009 05:10 PM, Gregory Haskins wrote: >>> > If kvm can do it, others can. > > The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. >>> In the kernel. IMO that's the wrong place for it. >>> >> 3) "in-kernel": You can do something like virtio-net to vhost to >> potentially meet some of the requirements, but not all. >> >> In order to fully meet (3), you would need to do some of that stuff you >> mentioned in the last reply with muxing device-nr/reg-nr. In addition, >> we need to have a facility for mapping eventfds and establishing a >> signaling mechanism (like PIO+qid), etc. KVM does this with >> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be >> invented. >> > > irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. Not per se, but it needs to be interfaced. How do I register that eventfd with the fastpath in Ira's rig? How do I signal the eventfd (x86->ppc, and ppc->x86)? To take it to the next level, how do I organize that mechanism so that it works for more than one IO-stream (e.g. address the various queues within ethernet or a different device like the console)? KVM has IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not have the luxury of an established IO paradigm. Is vbus the only way to implement a solution? No. But it is _a_ way, and its one that was specifically designed to solve this very problem (as well as others). (As an aside, note that you generally will want an abstraction on top of irqfd/eventfd like shm-signal or virtqueues to do shared-memory based event mitigation, but I digress. That is a separate topic). > >> To meet performance, this stuff has to be in kernel and there has to be >> a way to manage it. > > and management belongs in userspace. vbus does not dictate where the management must be. Its an extensible framework, governed by what you plug into it (ala connectors and devices). For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD and DEVDROP hotswap events into the interrupt stream, because they are simple and we already needed the interrupt stream anyway for fast-path. As another example: venet chose to put ->call(MACQUERY) "config-space" into its call namespace because its simple, and we already need ->calls() for fastpath. It therefore exports an attribute to sysfs that allows the management app to set it. I could likewise have designed the connector or device-model differently as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI userspace) but this seems silly to me when they are so trivial, so I didn't. > >> Since vbus was designed to do exactly that, this is >> what I would advocate. You could also reinvent these concepts and put >> your own mux and mapping code in place, in addition to all the other >> stuff that vbus does. But I am not clear why anyone would want to. >> > > Maybe they like their backward compatibility and Windows support. This is really not relevant to this thread, since we are talking about Ira's hardware. But if you must bring this up, then I will reiterate that you just design the connector to interface with QEMU+PCI and you have that too if that was important to you. But on that topic: Since you could consider KVM a "motherboard manufacturer" of sorts (it just happens to be virtual hardware), I don't know why KVM seems to consider itself the only motherboard manufacturer in the world that has to make everything look legacy. If a company like ASUS wants to add some cutting edge IO controller/bus, they simply do it. Pretty much every product release may contain a different array of devices, many of which are not backwards compatible with any prior silicon. The guy/gal installing Windows on that system may see a "?" in device-manager until they load a driver that supports the new chip, and subsequently it works. It is certainly not a requirement to make said chip somehow work with existing drivers/facilities on bare metal, per se. Why should virtual systems be different? So, yeah, the current design of the vbus-kvm connector means I have to provide a driver. This is understood, and I have no problem with that. The only thing that I would agree has to be backwards compatible is the BIOS/boot function. If you can't support running an image like the Windows installer, you are hosed. If you can't use your ethernet until you get a chance to install a driver after the install completes, its just like most other systems in existence. IOW: It's not a big deal. For cases where the IO system is needed
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/16/2009 10:22 PM, Gregory Haskins wrote: > Avi Kivity wrote: > >> On 09/16/2009 05:10 PM, Gregory Haskins wrote: >> If kvm can do it, others can. >>> The problem is that you seem to either hand-wave over details like this, >>> or you give details that are pretty much exactly what vbus does already. >>>My point is that I've already sat down and thought about these issues >>> and solved them in a freely available GPL'ed software package. >>> >>> >> In the kernel. IMO that's the wrong place for it. >> > 3) "in-kernel": You can do something like virtio-net to vhost to > potentially meet some of the requirements, but not all. > > In order to fully meet (3), you would need to do some of that stuff you > mentioned in the last reply with muxing device-nr/reg-nr. In addition, > we need to have a facility for mapping eventfds and establishing a > signaling mechanism (like PIO+qid), etc. KVM does this with > IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be > invented. > irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. > To meet performance, this stuff has to be in kernel and there has to be > a way to manage it. and management belongs in userspace. > Since vbus was designed to do exactly that, this is > what I would advocate. You could also reinvent these concepts and put > your own mux and mapping code in place, in addition to all the other > stuff that vbus does. But I am not clear why anyone would want to. > Maybe they like their backward compatibility and Windows support. > So no, the kernel is not the wrong place for it. Its the _only_ place > for it. Otherwise, just use (1) and be done with it. > > I'm talking about the config stuff, not the data path. >> Further, if we adopt >> vbus, if drop compatibility with existing guests or have to support both >> vbus and virtio-pci. >> > We already need to support both (at least to support Ira). virtio-pci > doesn't work here. Something else (vbus, or vbus-like) is needed. > virtio-ira. >>> So the question is: is your position that vbus is all wrong and you wish >>> to create a new bus-like thing to solve the problem? >>> >> I don't intend to create anything new, I am satisfied with virtio. If >> it works for Ira, excellent. If not, too bad. >> > I think that about sums it up, then. > Yes. I'm all for reusing virtio, but I'm not going switch to vbus or support both for this esoteric use case. >>> If so, how is it >>> different from what Ive already done? More importantly, what specific >>> objections do you have to what Ive done, as perhaps they can be fixed >>> instead of starting over? >>> >>> >> The two biggest objections are: >> - the host side is in the kernel >> > As it needs to be. > vhost-net somehow manages to work without the config stuff in the kernel. > With all due respect, based on all of your comments in aggregate I > really do not think you are truly grasping what I am actually building here. > Thanks. >>> Bingo. So now its a question of do you want to write this layer from >>> scratch, or re-use my framework. >>> >>> >> You will have to implement a connector or whatever for vbus as well. >> vbus has more layers so it's probably smaller for vbus. >> > Bingo! (addictive, isn't it) > That is precisely the point. > > All the stuff for how to map eventfds, handle signal mitigation, demux > device/function pointers, isolation, etc, are built in. All the > connector has to do is transport the 4-6 verbs and provide a memory > mapping/copy function, and the rest is reusable. The device models > would then work in all environments unmodified, and likewise the > connectors could use all device-models unmodified. > Well, virtio has a similar abstraction on the guest side. The host side abstraction is limited to signalling since all configuration is in userspace. vhost-net ought to work for lguest and s390 without change. >> It was already implemented three times for virtio, so apparently that's >> extensible too. >> > And to my point, I'm trying to commoditize as much of that process as > possible on both the front and backends (at least for cases where > performance matters) so that you don't need to reinvent the wheel for > each one. > Since you're interested in any-to-any connectors it makes sense to you. I'm only interested in kvm-host-to-kvm-guest, so reducing the already minor effort to implement a new virtio binding has little appeal to me. >> You mean, if the x86 board was able to access the disks and dma into the >> ppb boards memory? You'd run vhost-blk on x86 and virtio-net on ppc. >> > But as we discussed, vhost doesn't work well if you try to run it on the > x86 side due to its assumptions about pagable "guest" memory, right? So > is that even an option? And even still, you would still need to solve > t
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/16/2009 05:10 PM, Gregory Haskins wrote: >> >>> If kvm can do it, others can. >>> >> The problem is that you seem to either hand-wave over details like this, >> or you give details that are pretty much exactly what vbus does already. >> My point is that I've already sat down and thought about these issues >> and solved them in a freely available GPL'ed software package. >> > > In the kernel. IMO that's the wrong place for it. In conversations with Ira, he indicated he needs kernel-to-kernel ethernet for performance, and needs at least an ethernet and console connectivity. You could conceivably build a solution for this system 3 basic ways: 1) "completely" in userspace: use things like tuntap on the ppc boards, and tunnel packets across a custom point-to-point connection formed over the pci link to a userspace app on the x86 board. This app then reinjects the packets into the x86 kernel as a raw socket or tuntap, etc. Pretty much vanilla tuntap/vpn kind of stuff. Advantage: very little kernel code. Problem: performance (citation: hopefully obvious). 2) "partially" in userspace: have an in-kernel virtio-net driver talk to a userspace based virtio-net backend. This is the (current, non-vhost oriented) KVM/qemu model. Advantage, re-uses existing kernel-code. Problem: performance (citation: see alacrityvm numbers). 3) "in-kernel": You can do something like virtio-net to vhost to potentially meet some of the requirements, but not all. In order to fully meet (3), you would need to do some of that stuff you mentioned in the last reply with muxing device-nr/reg-nr. In addition, we need to have a facility for mapping eventfds and establishing a signaling mechanism (like PIO+qid), etc. KVM does this with IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be invented. To meet performance, this stuff has to be in kernel and there has to be a way to manage it. Since vbus was designed to do exactly that, this is what I would advocate. You could also reinvent these concepts and put your own mux and mapping code in place, in addition to all the other stuff that vbus does. But I am not clear why anyone would want to. So no, the kernel is not the wrong place for it. Its the _only_ place for it. Otherwise, just use (1) and be done with it. > Further, if we adopt > vbus, if drop compatibility with existing guests or have to support both > vbus and virtio-pci. We already need to support both (at least to support Ira). virtio-pci doesn't work here. Something else (vbus, or vbus-like) is needed. > >> So the question is: is your position that vbus is all wrong and you wish >> to create a new bus-like thing to solve the problem? > > I don't intend to create anything new, I am satisfied with virtio. If > it works for Ira, excellent. If not, too bad. I think that about sums it up, then. > I believe it will work without too much trouble. Afaict it wont for the reasons I mentioned. > >> If so, how is it >> different from what Ive already done? More importantly, what specific >> objections do you have to what Ive done, as perhaps they can be fixed >> instead of starting over? >> > > The two biggest objections are: > - the host side is in the kernel As it needs to be. > - the guest side is a new bus instead of reusing pci (on x86/kvm), > making Windows support more difficult Thats a function of the vbus-connector, which is different from vbus-core. If you don't like it (and I know you don't), we can write one that interfaces to qemu's pci system. I just don't like the limitations that imposes, nor do I think we need that complexity of dealing with a split PCI model, so I chose to not implement vbus-kvm this way. With all due respect, based on all of your comments in aggregate I really do not think you are truly grasping what I am actually building here. > > I guess these two are exactly what you think are vbus' greatest > advantages, so we'll probably have to extend our agree-to-disagree on > this one. > > I also had issues with using just one interrupt vector to service all > events, but that's easily fixed. Again, function of the connector. > >>> There is no guest and host in this scenario. There's a device side >>> (ppc) and a driver side (x86). The driver side can access configuration >>> information on the device side. How to multiplex multiple devices is an >>> interesting exercise for whoever writes the virtio binding for that >>> setup. >>> >> Bingo. So now its a question of do you want to write this layer from >> scratch, or re-use my framework. >> > > You will have to implement a connector or whatever for vbus as well. > vbus has more layers so it's probably smaller for vbus. Bingo! That is precisely the point. All the stuff for how to map eventfds, handle signal mitigation, demux device/function pointers, isolation, etc, are built in. All the connector has to do is transport the 4-6 verbs and provide
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 05:22:37PM +0200, Arnd Bergmann wrote: > On Wednesday 16 September 2009, Michael S. Tsirkin wrote: > > On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote: > > > On Tuesday 15 September 2009, Michael S. Tsirkin wrote: > > > > Userspace in x86 maps a PCI region, uses it for communication with ppc? > > > > > > This might have portability issues. On x86 it should work, but if the > > > host is powerpc or similar, you cannot reliably access PCI I/O memory > > > through copy_tofrom_user but have to use memcpy_toio/fromio or > > > readl/writel > > > calls, which don't work on user pointers. > > > > > > Specifically on powerpc, copy_from_user cannot access unaligned buffers > > > if they are on an I/O mapping. > > > > > We are talking about doing this in userspace, not in kernel. > > Ok, that's fine then. I thought the idea was to use the vhost_net driver It's a separate issue. We were talking generally about configuration and setup. Gregory implemented it in kernel, Avi wants it moved to userspace, with only fastpath in kernel. > to access the user memory, which would be a really cute hack otherwise, > as you'd only need to provide the eventfds from a hardware specific > driver and could use the regular virtio_net on the other side. > > Arnd <>< To do that, maybe copy to user on ppc can be fixed, or wrapped around in a arch specific macro, so that everyone else does not have to go through abstraction layers. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/16/2009 05:10 PM, Gregory Haskins wrote: > >> If kvm can do it, others can. >> > The problem is that you seem to either hand-wave over details like this, > or you give details that are pretty much exactly what vbus does already. > My point is that I've already sat down and thought about these issues > and solved them in a freely available GPL'ed software package. > In the kernel. IMO that's the wrong place for it. Further, if we adopt vbus, if drop compatibility with existing guests or have to support both vbus and virtio-pci. > So the question is: is your position that vbus is all wrong and you wish > to create a new bus-like thing to solve the problem? I don't intend to create anything new, I am satisfied with virtio. If it works for Ira, excellent. If not, too bad. I believe it will work without too much trouble. > If so, how is it > different from what Ive already done? More importantly, what specific > objections do you have to what Ive done, as perhaps they can be fixed > instead of starting over? > The two biggest objections are: - the host side is in the kernel - the guest side is a new bus instead of reusing pci (on x86/kvm), making Windows support more difficult I guess these two are exactly what you think are vbus' greatest advantages, so we'll probably have to extend our agree-to-disagree on this one. I also had issues with using just one interrupt vector to service all events, but that's easily fixed. >> There is no guest and host in this scenario. There's a device side >> (ppc) and a driver side (x86). The driver side can access configuration >> information on the device side. How to multiplex multiple devices is an >> interesting exercise for whoever writes the virtio binding for that setup. >> > Bingo. So now its a question of do you want to write this layer from > scratch, or re-use my framework. > You will have to implement a connector or whatever for vbus as well. vbus has more layers so it's probably smaller for vbus. >>> I am talking about how we would tunnel the config space for N devices >>> across his transport. >>> >>> >> Sounds trivial. >> > No one said it was rocket science. But it does need to be designed and > implemented end-to-end, much of which Ive already done in what I hope is > an extensible way. > It was already implemented three times for virtio, so apparently that's extensible too. >> Write an address containing the device number and >> register number to on location, read or write data from another. >> > You mean like the "u64 devh", and "u32 func" fields I have here for the > vbus-kvm connector? > > http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64 > > Probably. >>> That sounds convenient given his hardware, but it has its own set of >>> problems. For one, the configuration/inventory of these boards is now >>> driven by the wrong side and has to be addressed. >>> >> Why is it the wrong side? >> > "Wrong" is probably too harsh a word when looking at ethernet. Its > certainly "odd", and possibly inconvenient. It would be like having > vhost in a KVM guest, and virtio-net running on the host. You could do > it, but its weird and awkward. Where it really falls apart and enters > the "wrong" category is for non-symmetric devices, like disk-io. > > It's not odd or wrong or wierd or awkward. An ethernet NIC is not symmetric, one side does DMA and issues interrupts, the other uses its own memory. That's exactly the case with Ira's setup. If the ppc boards were to emulate a disk controller, you'd run virtio-blk on x86 and vhost-blk on the ppc boards. >>> Second, the role >>> reversal will likely not work for many models other than ethernet (e.g. >>> virtio-console or virtio-blk drivers running on the x86 board would be >>> naturally consuming services from the slave boards...virtio-net is an >>> exception because 802.x is generally symmetrical). >>> >>> >> There is no role reversal. >> > So if I have virtio-blk driver running on the x86 and vhost-blk device > running on the ppc board, I can use the ppc board as a block-device. > What if I really wanted to go the other way? > You mean, if the x86 board was able to access the disks and dma into the ppb boards memory? You'd run vhost-blk on x86 and virtio-net on ppc. As long as you don't use the words "guest" and "host" but keep to "driver" and "device", it all works out. >> The side doing dma is the device, the side >> accessing its own memory is the driver. Just like that other 1e12 >> driver/device pairs out there. >> > IIUC, his ppc boards really can be seen as "guests" (they are linux > instances that are utilizing services from the x86, not the other way > around). They aren't guests. Guests d
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: > On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote: > > On Tuesday 15 September 2009, Michael S. Tsirkin wrote: > > > Userspace in x86 maps a PCI region, uses it for communication with ppc? > > > > This might have portability issues. On x86 it should work, but if the > > host is powerpc or similar, you cannot reliably access PCI I/O memory > > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel > > calls, which don't work on user pointers. > > > > Specifically on powerpc, copy_from_user cannot access unaligned buffers > > if they are on an I/O mapping. > > > We are talking about doing this in userspace, not in kernel. Ok, that's fine then. I thought the idea was to use the vhost_net driver to access the user memory, which would be a really cute hack otherwise, as you'd only need to provide the eventfds from a hardware specific driver and could use the regular virtio_net on the other side. Arnd <>< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote: > On Tuesday 15 September 2009, Michael S. Tsirkin wrote: > > Userspace in x86 maps a PCI region, uses it for communication with ppc? > > This might have portability issues. On x86 it should work, but if the > host is powerpc or similar, you cannot reliably access PCI I/O memory > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel > calls, which don't work on user pointers. > > Specifically on powerpc, copy_from_user cannot access unaligned buffers > if they are on an I/O mapping. > > Arnd <>< We are talking about doing this in userspace, not in kernel. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tuesday 15 September 2009, Michael S. Tsirkin wrote: > Userspace in x86 maps a PCI region, uses it for communication with ppc? This might have portability issues. On x86 it should work, but if the host is powerpc or similar, you cannot reliably access PCI I/O memory through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel calls, which don't work on user pointers. Specifically on powerpc, copy_from_user cannot access unaligned buffers if they are on an I/O mapping. Arnd <>< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/16/2009 02:44 PM, Gregory Haskins wrote: >> The problem isn't where to find the models...the problem is how to >> aggregate multiple models to the guest. >> > > You mean configuration? > >>> You instantiate multiple vhost-nets. Multiple ethernet NICs is a >>> supported configuration for kvm. >>> >> But this is not KVM. >> >> > > If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. So the question is: is your position that vbus is all wrong and you wish to create a new bus-like thing to solve the problem? If so, how is it different from what Ive already done? More importantly, what specific objections do you have to what Ive done, as perhaps they can be fixed instead of starting over? > His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? >>> I don't really see the difference between 1 and N here. >>> >> A KVM surfaces N virtio-devices as N pci-devices to the guest. What do >> we do in Ira's case where the entire guest represents itself as a PCI >> device to the host, and nothing the other way around? >> > > There is no guest and host in this scenario. There's a device side > (ppc) and a driver side (x86). The driver side can access configuration > information on the device side. How to multiplex multiple devices is an > interesting exercise for whoever writes the virtio binding for that setup. Bingo. So now its a question of do you want to write this layer from scratch, or re-use my framework. > There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) >>> I'm not sure if you're talking about the configuration interface or data >>> path here. >>> >> I am talking about how we would tunnel the config space for N devices >> across his transport. >> > > Sounds trivial. No one said it was rocket science. But it does need to be designed and implemented end-to-end, much of which Ive already done in what I hope is an extensible way. > Write an address containing the device number and > register number to on location, read or write data from another. You mean like the "u64 devh", and "u32 func" fields I have here for the vbus-kvm connector? http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64 > Just > like the PCI cf8/cfc interface. > >>> They aren't in the "guest". The best way to look at it is >>> >>> - a device side, with a dma engine: vhost-net >>> - a driver side, only accessing its own memory: virtio-net >>> >>> Given that Ira's config has the dma engine in the ppc boards, that's >>> where vhost-net would live (the ppc boards acting as NICs to the x86 >>> board, essentially). >>> >> That sounds convenient given his hardware, but it has its own set of >> problems. For one, the configuration/inventory of these boards is now >> driven by the wrong side and has to be addressed. > > Why is it the wrong side? "Wrong" is probably too harsh a word when looking at ethernet. Its certainly "odd", and possibly inconvenient. It would be like having vhost in a KVM guest, and virtio-net running on the host. You could do it, but its weird and awkward. Where it really falls apart and enters the "wrong" category is for non-symmetric devices, like disk-io. > >> Second, the role >> reversal will likely not work for many models other than ethernet (e.g. >> virtio-console or virtio-blk drivers running on the x86 board would be >> naturally consuming services from the slave boards...virtio-net is an >> exception because 802.x is generally symmetrical). >> > > There is no role reversal. So if I have virtio-blk driver running on the x86 and vhost-blk device running on the ppc board, I can use the ppc board as a block-device. What if I really wanted to go the other way? > The side doing dma is the device, the side > accessing its own memory is the driver. Just like that other 1e12 > driver/device pairs out there. IIUC, his ppc boards really can be seen as "guests" (they are linux instances that are utilizing services from the x86, not the other way around). vhost forces the model to have the ppc boards act as IO-hosts, whereas vbus would likely work in either direction due to its more refined abstraction layer. > >>> I have no idea, that's for Ira to solve. >>> >> Bingo. Thus
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/16/2009 02:44 PM, Gregory Haskins wrote: > The problem isn't where to find the models...the problem is how to > aggregate multiple models to the guest. > You mean configuration? >> You instantiate multiple vhost-nets. Multiple ethernet NICs is a >> supported configuration for kvm. >> > But this is not KVM. > > If kvm can do it, others can. >>> His slave boards surface themselves as PCI devices to the x86 >>> host. So how do you use that to make multiple vhost-based devices (say >>> two virtio-nets, and a virtio-console) communicate across the transport? >>> >>> >> I don't really see the difference between 1 and N here. >> > A KVM surfaces N virtio-devices as N pci-devices to the guest. What do > we do in Ira's case where the entire guest represents itself as a PCI > device to the host, and nothing the other way around? > There is no guest and host in this scenario. There's a device side (ppc) and a driver side (x86). The driver side can access configuration information on the device side. How to multiplex multiple devices is an interesting exercise for whoever writes the virtio binding for that setup. >>> There are multiple ways to do this, but what I am saying is that >>> whatever is conceived will start to look eerily like a vbus-connector, >>> since this is one of its primary purposes ;) >>> >>> >> I'm not sure if you're talking about the configuration interface or data >> path here. >> > I am talking about how we would tunnel the config space for N devices > across his transport. > Sounds trivial. Write an address containing the device number and register number to on location, read or write data from another. Just like the PCI cf8/cfc interface. >> They aren't in the "guest". The best way to look at it is >> >> - a device side, with a dma engine: vhost-net >> - a driver side, only accessing its own memory: virtio-net >> >> Given that Ira's config has the dma engine in the ppc boards, that's >> where vhost-net would live (the ppc boards acting as NICs to the x86 >> board, essentially). >> > That sounds convenient given his hardware, but it has its own set of > problems. For one, the configuration/inventory of these boards is now > driven by the wrong side and has to be addressed. Why is it the wrong side? > Second, the role > reversal will likely not work for many models other than ethernet (e.g. > virtio-console or virtio-blk drivers running on the x86 board would be > naturally consuming services from the slave boards...virtio-net is an > exception because 802.x is generally symmetrical). > There is no role reversal. The side doing dma is the device, the side accessing its own memory is the driver. Just like that other 1e12 driver/device pairs out there. >> I have no idea, that's for Ira to solve. >> > Bingo. Thus my statement that the vhost proposal is incomplete. You > have the virtio-net and vhost-net pieces covering the fast-path > end-points, but nothing in the middle (transport, aggregation, > config-space), and nothing on the management-side. vbus provides most > of the other pieces, and can even support the same virtio-net protocol > on top. The remaining part would be something like a udev script to > populate the vbus with devices on board-insert events. > Of course vhost is incomplete, in the same sense that Linux is incomplete. Both require userspace. >> If he could fake the PCI >> config space as seen by the x86 board, he would just show the normal pci >> config and use virtio-pci (multiple channels would show up as a >> multifunction device). Given he can't, he needs to tunnel the virtio >> config space some other way. >> > Right, and note that vbus was designed to solve this. This tunneling > can, of course, be done without vbus using some other design. However, > whatever solution is created will look incredibly close to what I've > already done, so my point is "why reinvent it"? > virtio requires binding for this tunnelling, so does vbus. Its the same problem with the same solution. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/15/2009 11:08 PM, Gregory Haskins wrote: >> >>> There's virtio-console, virtio-blk etc. None of these have kernel-mode >>> servers, but these could be implemented if/when needed. >>> >> IIUC, Ira already needs at least ethernet and console capability. >> >> > > He's welcome to pick up the necessary code from qemu. The problem isn't where to find the models...the problem is how to aggregate multiple models to the guest. > b) what do you suppose this protocol to aggregate the connections would look like? (hint: this is what a vbus-connector does). >>> You mean multilink? You expose the device as a multiqueue. >>> >> No, what I mean is how do you surface multiple ethernet and consoles to >> the guests? For Ira's case, I think he needs at minimum at least one of >> each, and he mentioned possibly having two unique ethernets at one point. >> > > You instantiate multiple vhost-nets. Multiple ethernet NICs is a > supported configuration for kvm. But this is not KVM. > >> His slave boards surface themselves as PCI devices to the x86 >> host. So how do you use that to make multiple vhost-based devices (say >> two virtio-nets, and a virtio-console) communicate across the transport? >> > > I don't really see the difference between 1 and N here. A KVM surfaces N virtio-devices as N pci-devices to the guest. What do we do in Ira's case where the entire guest represents itself as a PCI device to the host, and nothing the other way around? > >> There are multiple ways to do this, but what I am saying is that >> whatever is conceived will start to look eerily like a vbus-connector, >> since this is one of its primary purposes ;) >> > > I'm not sure if you're talking about the configuration interface or data > path here. I am talking about how we would tunnel the config space for N devices across his transport. As an aside, the vbus-kvm connector makes them one and the same, but they do not have to be. Its all in the connector design. > c) how do you manage the configuration, especially on a per-board basis? >>> pci (for kvm/x86). >>> >> Ok, for kvm understood (and I would also add "qemu" to that mix). But >> we are talking about vhost's application in a non-kvm environment here, >> right?. >> >> So if the vhost-X devices are in the "guest", > > They aren't in the "guest". The best way to look at it is > > - a device side, with a dma engine: vhost-net > - a driver side, only accessing its own memory: virtio-net > > Given that Ira's config has the dma engine in the ppc boards, that's > where vhost-net would live (the ppc boards acting as NICs to the x86 > board, essentially). That sounds convenient given his hardware, but it has its own set of problems. For one, the configuration/inventory of these boards is now driven by the wrong side and has to be addressed. Second, the role reversal will likely not work for many models other than ethernet (e.g. virtio-console or virtio-blk drivers running on the x86 board would be naturally consuming services from the slave boards...virtio-net is an exception because 802.x is generally symmetrical). IIUC, vbus would support having the device models live properly on the x86 side, solving both of these problems. It would be impossible to reverse vhost given its current design. > >> and the x86 board is just >> a slave...How do you tell each ppc board how many devices and what >> config (e.g. MACs, etc) to instantiate? Do you assume that they should >> all be symmetric and based on positional (e.g. slot) data? What if you >> want asymmetric configurations (if not here, perhaps in a different >> environment)? >> > > I have no idea, that's for Ira to solve. Bingo. Thus my statement that the vhost proposal is incomplete. You have the virtio-net and vhost-net pieces covering the fast-path end-points, but nothing in the middle (transport, aggregation, config-space), and nothing on the management-side. vbus provides most of the other pieces, and can even support the same virtio-net protocol on top. The remaining part would be something like a udev script to populate the vbus with devices on board-insert events. > If he could fake the PCI > config space as seen by the x86 board, he would just show the normal pci > config and use virtio-pci (multiple channels would show up as a > multifunction device). Given he can't, he needs to tunnel the virtio > config space some other way. Right, and note that vbus was designed to solve this. This tunneling can, of course, be done without vbus using some other design. However, whatever solution is created will look incredibly close to what I've already done, so my point is "why reinvent it"? > >>> Yes. virtio is really virtualization oriented. >>> >> I would say that its vhost in particular that is virtualization >> oriented. virtio, as a concept, generally should work in physical >
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/15/2009 11:08 PM, Gregory Haskins wrote: > >> There's virtio-console, virtio-blk etc. None of these have kernel-mode >> servers, but these could be implemented if/when needed. >> > IIUC, Ira already needs at least ethernet and console capability. > > He's welcome to pick up the necessary code from qemu. >>> b) what do you suppose this protocol to aggregate the connections would >>> look like? (hint: this is what a vbus-connector does). >>> >>> >> You mean multilink? You expose the device as a multiqueue. >> > No, what I mean is how do you surface multiple ethernet and consoles to > the guests? For Ira's case, I think he needs at minimum at least one of > each, and he mentioned possibly having two unique ethernets at one point. > You instantiate multiple vhost-nets. Multiple ethernet NICs is a supported configuration for kvm. > His slave boards surface themselves as PCI devices to the x86 > host. So how do you use that to make multiple vhost-based devices (say > two virtio-nets, and a virtio-console) communicate across the transport? > I don't really see the difference between 1 and N here. > There are multiple ways to do this, but what I am saying is that > whatever is conceived will start to look eerily like a vbus-connector, > since this is one of its primary purposes ;) > I'm not sure if you're talking about the configuration interface or data path here. >>> c) how do you manage the configuration, especially on a per-board basis? >>> >>> >> pci (for kvm/x86). >> > Ok, for kvm understood (and I would also add "qemu" to that mix). But > we are talking about vhost's application in a non-kvm environment here, > right?. > > So if the vhost-X devices are in the "guest", They aren't in the "guest". The best way to look at it is - a device side, with a dma engine: vhost-net - a driver side, only accessing its own memory: virtio-net Given that Ira's config has the dma engine in the ppc boards, that's where vhost-net would live (the ppc boards acting as NICs to the x86 board, essentially). > and the x86 board is just > a slave...How do you tell each ppc board how many devices and what > config (e.g. MACs, etc) to instantiate? Do you assume that they should > all be symmetric and based on positional (e.g. slot) data? What if you > want asymmetric configurations (if not here, perhaps in a different > environment)? > I have no idea, that's for Ira to solve. If he could fake the PCI config space as seen by the x86 board, he would just show the normal pci config and use virtio-pci (multiple channels would show up as a multifunction device). Given he can't, he needs to tunnel the virtio config space some other way. >> Yes. virtio is really virtualization oriented. >> > I would say that its vhost in particular that is virtualization > oriented. virtio, as a concept, generally should work in physical > systems, if perhaps with some minor modifications. The biggest "limit" > is having "virt" in its name ;) > Let me rephrase. The virtio developers are virtualization oriented. If it works for non-virt applications, that's good, but not a design goal. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote: >> Michael S. Tsirkin wrote: >>> On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: > On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: >> No, what I mean is how do you surface multiple ethernet and consoles to >> the guests? For Ira's case, I think he needs at minimum at least one of >> each, and he mentioned possibly having two unique ethernets at one point. >> >> His slave boards surface themselves as PCI devices to the x86 >> host. So how do you use that to make multiple vhost-based devices (say >> two virtio-nets, and a virtio-console) communicate across the transport? >> >> There are multiple ways to do this, but what I am saying is that >> whatever is conceived will start to look eerily like a vbus-connector, >> since this is one of its primary purposes ;) > Can't all this be in userspace? Can you outline your proposal? -Greg >>> Userspace in x86 maps a PCI region, uses it for communication with ppc? >>> >> And what do you propose this communication to look like? > > Who cares? Implement vbus protocol there if you like. > Exactly. My point is that you need something like a vbus protocol there. ;) Here is the protocol I run over PCI in AlacrityVM: http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025 And I guess to your point, yes the protocol can technically be in userspace (outside of whatever you need for the in-kernel portion of the communication transport, if any. The vbus-connector design does not specify where the protocol needs to take place, per se. Note, however, for performance reasons some parts of the protocol may want to be in the kernel (such as DEVCALL and SHMSIGNAL). It is for this reason that I just run all of it there, because IMO its simpler than splitting it up. The slow path stuff just rides on infrastructure that I need for fast-path anyway, so it doesn't really cost me anything additional. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: > >>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: > No, what I mean is how do you surface multiple ethernet and consoles to > the guests? For Ira's case, I think he needs at minimum at least one of > each, and he mentioned possibly having two unique ethernets at one point. > > His slave boards surface themselves as PCI devices to the x86 > host. So how do you use that to make multiple vhost-based devices (say > two virtio-nets, and a virtio-console) communicate across the transport? > > There are multiple ways to do this, but what I am saying is that > whatever is conceived will start to look eerily like a vbus-connector, > since this is one of its primary purposes ;) > >>> Can't all this be in userspace? > >> Can you outline your proposal? > >> > >> -Greg > >> > > > > Userspace in x86 maps a PCI region, uses it for communication with ppc? > > > > And what do you propose this communication to look like? Who cares? Implement vbus protocol there if you like. > -Greg > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote: >> Michael S. Tsirkin wrote: >>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: No, what I mean is how do you surface multiple ethernet and consoles to the guests? For Ira's case, I think he needs at minimum at least one of each, and he mentioned possibly having two unique ethernets at one point. His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) >>> Can't all this be in userspace? >> Can you outline your proposal? >> >> -Greg >> > > Userspace in x86 maps a PCI region, uses it for communication with ppc? > And what do you propose this communication to look like? -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: > >> No, what I mean is how do you surface multiple ethernet and consoles to > >> the guests? For Ira's case, I think he needs at minimum at least one of > >> each, and he mentioned possibly having two unique ethernets at one point. > >> > >> His slave boards surface themselves as PCI devices to the x86 > >> host. So how do you use that to make multiple vhost-based devices (say > >> two virtio-nets, and a virtio-console) communicate across the transport? > >> > >> There are multiple ways to do this, but what I am saying is that > >> whatever is conceived will start to look eerily like a vbus-connector, > >> since this is one of its primary purposes ;) > > > > Can't all this be in userspace? > > Can you outline your proposal? > > -Greg > Userspace in x86 maps a PCI region, uses it for communication with ppc? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: >> No, what I mean is how do you surface multiple ethernet and consoles to >> the guests? For Ira's case, I think he needs at minimum at least one of >> each, and he mentioned possibly having two unique ethernets at one point. >> >> His slave boards surface themselves as PCI devices to the x86 >> host. So how do you use that to make multiple vhost-based devices (say >> two virtio-nets, and a virtio-console) communicate across the transport? >> >> There are multiple ways to do this, but what I am saying is that >> whatever is conceived will start to look eerily like a vbus-connector, >> since this is one of its primary purposes ;) > > Can't all this be in userspace? Can you outline your proposal? -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: > No, what I mean is how do you surface multiple ethernet and consoles to > the guests? For Ira's case, I think he needs at minimum at least one of > each, and he mentioned possibly having two unique ethernets at one point. > > His slave boards surface themselves as PCI devices to the x86 > host. So how do you use that to make multiple vhost-based devices (say > two virtio-nets, and a virtio-console) communicate across the transport? > > There are multiple ways to do this, but what I am saying is that > whatever is conceived will start to look eerily like a vbus-connector, > since this is one of its primary purposes ;) Can't all this be in userspace? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/15/2009 04:50 PM, Gregory Haskins wrote: >>> Why? vhost will call get_user_pages() or copy_*_user() which ought to >>> do the right thing. >>> >> I was speaking generally, not specifically to Ira's architecture. What >> I mean is that vbus was designed to work without assuming that the >> memory is pageable. There are environments in which the host is not >> capable of mapping hvas/*page, but the memctx->copy_to/copy_from >> paradigm could still work (think rdma, for instance). >> > > Sure, vbus is more flexible here. > As an aside: a bigger issue is that, iiuc, Ira wants more than a single ethernet channel in his design (multiple ethernets, consoles, etc). A vhost solution in this environment is incomplete. >>> Why? Instantiate as many vhost-nets as needed. >>> >> a) what about non-ethernets? >> > > There's virtio-console, virtio-blk etc. None of these have kernel-mode > servers, but these could be implemented if/when needed. IIUC, Ira already needs at least ethernet and console capability. > >> b) what do you suppose this protocol to aggregate the connections would >> look like? (hint: this is what a vbus-connector does). >> > > You mean multilink? You expose the device as a multiqueue. No, what I mean is how do you surface multiple ethernet and consoles to the guests? For Ira's case, I think he needs at minimum at least one of each, and he mentioned possibly having two unique ethernets at one point. His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) > >> c) how do you manage the configuration, especially on a per-board basis? >> > > pci (for kvm/x86). Ok, for kvm understood (and I would also add "qemu" to that mix). But we are talking about vhost's application in a non-kvm environment here, right?. So if the vhost-X devices are in the "guest", and the x86 board is just a slave...How do you tell each ppc board how many devices and what config (e.g. MACs, etc) to instantiate? Do you assume that they should all be symmetric and based on positional (e.g. slot) data? What if you want asymmetric configurations (if not here, perhaps in a different environment)? > >> Actually I have patches queued to allow vbus to be managed via ioctls as >> well, per your feedback (and it solves the permissions/lifetime >> critisims in alacrityvm-v0.1). >> > > That will make qemu integration easier. > >>> The only difference is the implementation. vhost-net >>> leaves much more to userspace, that's the main difference. >>> >> Also, >> >> *) vhost is virtio-net specific, whereas vbus is a more generic device >> model where thing like virtio-net or venet ride on top. >> > > I think vhost-net is separated into vhost and vhost-net. Thats good. > >> *) vhost is only designed to work with environments that look very >> similar to a KVM guest (slot/hva translatable). vbus can bridge various >> environments by abstracting the key components (such as memory access). >> > > Yes. virtio is really virtualization oriented. I would say that its vhost in particular that is virtualization oriented. virtio, as a concept, generally should work in physical systems, if perhaps with some minor modifications. The biggest "limit" is having "virt" in its name ;) > >> *) vhost requires an active userspace management daemon, whereas vbus >> can be driven by transient components, like scripts (ala udev) >> > > vhost by design leaves configuration and handshaking to userspace. I > see it as an advantage. The misconception here is that vbus by design _doesn't define_ where configuration/handshaking happens. It is primarily implemented by a modular component called a "vbus-connector", and _I_ see this flexibility as an advantage. vhost on the other hand depends on a active userspace component and a slots/hva memory design, which is more limiting in where it can be used and forces you to split the logic. However, I think we both more or less agree on this point already. For the record, vbus itself is simply a resource container for virtual-devices, which provides abstractions for the various points of interest to generalizing PV (memory, signals, etc) and the proper isolation and protection guarantees. What you do with it is defined by the modular virtual-devices (e.g. virtion-net, venet, sched, hrt, scsi, rdma, etc) and vbus-connectors (vbus-kvm, etc) you plug into it. As an example, you could emulate the vhost design in vbus by writing a "vbus-vhost" connector. This connector would be very thin and terminate locally in QEMU. It would provide a ioctl-based verb na
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/15/2009 04:50 PM, Gregory Haskins wrote: >> Why? vhost will call get_user_pages() or copy_*_user() which ought to >> do the right thing. >> > I was speaking generally, not specifically to Ira's architecture. What > I mean is that vbus was designed to work without assuming that the > memory is pageable. There are environments in which the host is not > capable of mapping hvas/*page, but the memctx->copy_to/copy_from > paradigm could still work (think rdma, for instance). > Sure, vbus is more flexible here. >>> As an aside: a bigger issue is that, iiuc, Ira wants more than a single >>> ethernet channel in his design (multiple ethernets, consoles, etc). A >>> vhost solution in this environment is incomplete. >>> >>> >> Why? Instantiate as many vhost-nets as needed. >> > a) what about non-ethernets? > There's virtio-console, virtio-blk etc. None of these have kernel-mode servers, but these could be implemented if/when needed. > b) what do you suppose this protocol to aggregate the connections would > look like? (hint: this is what a vbus-connector does). > You mean multilink? You expose the device as a multiqueue. > c) how do you manage the configuration, especially on a per-board basis? > pci (for kvm/x86). > Actually I have patches queued to allow vbus to be managed via ioctls as > well, per your feedback (and it solves the permissions/lifetime > critisims in alacrityvm-v0.1). > That will make qemu integration easier. >> The only difference is the implementation. vhost-net >> leaves much more to userspace, that's the main difference. >> > Also, > > *) vhost is virtio-net specific, whereas vbus is a more generic device > model where thing like virtio-net or venet ride on top. > I think vhost-net is separated into vhost and vhost-net. > *) vhost is only designed to work with environments that look very > similar to a KVM guest (slot/hva translatable). vbus can bridge various > environments by abstracting the key components (such as memory access). > Yes. virtio is really virtualization oriented. > *) vhost requires an active userspace management daemon, whereas vbus > can be driven by transient components, like scripts (ala udev) > vhost by design leaves configuration and handshaking to userspace. I see it as an advantage. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 15, 2009 at 09:50:39AM -0400, Gregory Haskins wrote: > Avi Kivity wrote: > > On 09/15/2009 04:03 PM, Gregory Haskins wrote: > >> > >>> In this case the x86 is the owner and the ppc boards use translated > >>> access. Just switch drivers and device and it falls into place. > >>> > >>> > >> You could switch vbus roles as well, I suppose. > > > > Right, there's not real difference in this regard. > > > >> Another potential > >> option is that he can stop mapping host memory on the guest so that it > >> follows the more traditional model. As a bus-master device, the ppc > >> boards should have access to any host memory at least in the GFP_DMA > >> range, which would include all relevant pointers here. > >> > >> I digress: I was primarily addressing the concern that Ira would need > >> to manage the "host" side of the link using hvas mapped from userspace > >> (even if host side is the ppc boards). vbus abstracts that access so as > >> to allow something other than userspace/hva mappings. OTOH, having each > >> ppc board run a userspace app to do the mapping on its behalf and feed > >> it to vhost is probably not a huge deal either. Where vhost might > >> really fall apart is when any assumptions about pageable memory occur, > >> if any. > >> > > > > Why? vhost will call get_user_pages() or copy_*_user() which ought to > > do the right thing. > > I was speaking generally, not specifically to Ira's architecture. What > I mean is that vbus was designed to work without assuming that the > memory is pageable. There are environments in which the host is not > capable of mapping hvas/*page, but the memctx->copy_to/copy_from > paradigm could still work (think rdma, for instance). rdma interfaces are typically asynchronous, so blocking copy_from/copy_to can be made to work, but likely won't work that well. DMA might work better if it is asynchronous as well. Assuming a synchronous copy is what we need - maybe the issue is that there aren't good APIs for x86/ppc communication? If so, sticking them in vhost might not be the best place. Maybe the specific platform can redefine copy_to/from_user to do the right thing? Or, maybe add another API for that ... > > > >> As an aside: a bigger issue is that, iiuc, Ira wants more than a single > >> ethernet channel in his design (multiple ethernets, consoles, etc). A > >> vhost solution in this environment is incomplete. > >> > > > > Why? Instantiate as many vhost-nets as needed. > > a) what about non-ethernets? vhost-net actually does not care. the packet is passed on to a socket, we are done. > b) what do you suppose this protocol to aggregate the connections would > look like? (hint: this is what a vbus-connector does). You are talking about management protocol between ppc and x86, right? One wonders why does it have to be in kernel at all. > c) how do you manage the configuration, especially on a per-board basis? not sure what a board is, but configuration is done in userspace. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/15/2009 04:03 PM, Gregory Haskins wrote: >> >>> In this case the x86 is the owner and the ppc boards use translated >>> access. Just switch drivers and device and it falls into place. >>> >>> >> You could switch vbus roles as well, I suppose. > > Right, there's not real difference in this regard. > >> Another potential >> option is that he can stop mapping host memory on the guest so that it >> follows the more traditional model. As a bus-master device, the ppc >> boards should have access to any host memory at least in the GFP_DMA >> range, which would include all relevant pointers here. >> >> I digress: I was primarily addressing the concern that Ira would need >> to manage the "host" side of the link using hvas mapped from userspace >> (even if host side is the ppc boards). vbus abstracts that access so as >> to allow something other than userspace/hva mappings. OTOH, having each >> ppc board run a userspace app to do the mapping on its behalf and feed >> it to vhost is probably not a huge deal either. Where vhost might >> really fall apart is when any assumptions about pageable memory occur, >> if any. >> > > Why? vhost will call get_user_pages() or copy_*_user() which ought to > do the right thing. I was speaking generally, not specifically to Ira's architecture. What I mean is that vbus was designed to work without assuming that the memory is pageable. There are environments in which the host is not capable of mapping hvas/*page, but the memctx->copy_to/copy_from paradigm could still work (think rdma, for instance). > >> As an aside: a bigger issue is that, iiuc, Ira wants more than a single >> ethernet channel in his design (multiple ethernets, consoles, etc). A >> vhost solution in this environment is incomplete. >> > > Why? Instantiate as many vhost-nets as needed. a) what about non-ethernets? b) what do you suppose this protocol to aggregate the connections would look like? (hint: this is what a vbus-connector does). c) how do you manage the configuration, especially on a per-board basis? > >> Note that Ira's architecture highlights that vbus's explicit management >> interface is more valuable here than it is in KVM, since KVM already has >> its own management interface via QEMU. >> > > vhost-net and vbus both need management, vhost-net via ioctls and vbus > via configfs. Actually I have patches queued to allow vbus to be managed via ioctls as well, per your feedback (and it solves the permissions/lifetime critisims in alacrityvm-v0.1). > The only difference is the implementation. vhost-net > leaves much more to userspace, that's the main difference. Also, *) vhost is virtio-net specific, whereas vbus is a more generic device model where thing like virtio-net or venet ride on top. *) vhost is only designed to work with environments that look very similar to a KVM guest (slot/hva translatable). vbus can bridge various environments by abstracting the key components (such as memory access). *) vhost requires an active userspace management daemon, whereas vbus can be driven by transient components, like scripts (ala udev) Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/15/2009 04:03 PM, Gregory Haskins wrote: > >> In this case the x86 is the owner and the ppc boards use translated >> access. Just switch drivers and device and it falls into place. >> >> > You could switch vbus roles as well, I suppose. Right, there's not real difference in this regard. > Another potential > option is that he can stop mapping host memory on the guest so that it > follows the more traditional model. As a bus-master device, the ppc > boards should have access to any host memory at least in the GFP_DMA > range, which would include all relevant pointers here. > > I digress: I was primarily addressing the concern that Ira would need > to manage the "host" side of the link using hvas mapped from userspace > (even if host side is the ppc boards). vbus abstracts that access so as > to allow something other than userspace/hva mappings. OTOH, having each > ppc board run a userspace app to do the mapping on its behalf and feed > it to vhost is probably not a huge deal either. Where vhost might > really fall apart is when any assumptions about pageable memory occur, > if any. > Why? vhost will call get_user_pages() or copy_*_user() which ought to do the right thing. > As an aside: a bigger issue is that, iiuc, Ira wants more than a single > ethernet channel in his design (multiple ethernets, consoles, etc). A > vhost solution in this environment is incomplete. > Why? Instantiate as many vhost-nets as needed. > Note that Ira's architecture highlights that vbus's explicit management > interface is more valuable here than it is in KVM, since KVM already has > its own management interface via QEMU. > vhost-net and vbus both need management, vhost-net via ioctls and vbus via configfs. The only difference is the implementation. vhost-net leaves much more to userspace, that's the main difference. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 09/14/2009 10:14 PM, Gregory Haskins wrote: >> To reiterate, as long as the model is such that the ppc boards are >> considered the "owner" (direct access, no translation needed) I believe >> it will work. If the pointers are expected to be owned by the host, >> then my model doesn't work well either. >> > > In this case the x86 is the owner and the ppc boards use translated > access. Just switch drivers and device and it falls into place. > You could switch vbus roles as well, I suppose. Another potential option is that he can stop mapping host memory on the guest so that it follows the more traditional model. As a bus-master device, the ppc boards should have access to any host memory at least in the GFP_DMA range, which would include all relevant pointers here. I digress: I was primarily addressing the concern that Ira would need to manage the "host" side of the link using hvas mapped from userspace (even if host side is the ppc boards). vbus abstracts that access so as to allow something other than userspace/hva mappings. OTOH, having each ppc board run a userspace app to do the mapping on its behalf and feed it to vhost is probably not a huge deal either. Where vhost might really fall apart is when any assumptions about pageable memory occur, if any. As an aside: a bigger issue is that, iiuc, Ira wants more than a single ethernet channel in his design (multiple ethernets, consoles, etc). A vhost solution in this environment is incomplete. Note that Ira's architecture highlights that vbus's explicit management interface is more valuable here than it is in KVM, since KVM already has its own management interface via QEMU. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/14/2009 10:14 PM, Gregory Haskins wrote: > To reiterate, as long as the model is such that the ppc boards are > considered the "owner" (direct access, no translation needed) I believe > it will work. If the pointers are expected to be owned by the host, > then my model doesn't work well either. > In this case the x86 is the owner and the ppc boards use translated access. Just switch drivers and device and it falls into place. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/14/2009 07:47 PM, Michael S. Tsirkin wrote: > On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote: > >> For Ira's example, the addresses would represent a physical address on >> the PCI boards, and would follow any kind of relevant rules for >> converting a "GPA" to a host accessible address (even if indirectly, via >> a dma controller). >> > I don't think limiting addresses to PCI physical addresses will work > well. From what I rememeber, Ira's x86 can not initiate burst > transactions on PCI, and it's the ppc that initiates all DMA. > vhost-net would run on the PPC then. >>> But we can't let the guest specify physical addresses. >>> >> Agreed. Neither your proposal nor mine operate this way afaict. >> > But this seems to be what Ira needs. > In Ira's scenario, the "guest" (x86 host) specifies x86 physical addresses, and the ppc dmas to them. It's the virtio model without any change. A normal guest also specifis physical addresses. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote: >> Michael S. Tsirkin wrote: >>> On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote: FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, the memory is not assumed to be a userspace address. Rather, it is a memctx-specific address, which can be userspace, or any other type (including hardware, dma-engine, etc). As long as the memctx knows how to translate it, it will work. >>> How would permissions be handled? >> Same as anything else, really. Read on for details. >> >>> it's easy to allow an app to pass in virtual addresses in its own address >>> space. >> Agreed, and this is what I do. >> >> The guest always passes its own physical addresses (using things like >> __pa() in linux). This address passed is memctx specific, but generally >> would fall into the category of "virtual-addresses" from the hosts >> perspective. >> >> For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed >> internally to the context via a gfn_to_hva conversion (you can see this >> occuring in the citation links I sent) >> >> For Ira's example, the addresses would represent a physical address on >> the PCI boards, and would follow any kind of relevant rules for >> converting a "GPA" to a host accessible address (even if indirectly, via >> a dma controller). > > So vbus can let an application "application" means KVM guest, or ppc board, right? > access either its own virtual memory or a physical memory on a PCI device. To reiterate from the last reply: the model is the "guest" owns the memory. The host is granted access to that memory by means of a memctx object, which must be admitted to the host kernel and accessed according to standard access-policy mechanisms. Generally the "application" or guest would never be accessing anything other than its own memory. > My question is, is any application > that's allowed to do the former also granted rights to do the later? If I understand your question, no. Can you elaborate? Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote: >> For Ira's example, the addresses would represent a physical address on >> the PCI boards, and would follow any kind of relevant rules for >> converting a "GPA" to a host accessible address (even if indirectly, via >> a dma controller). > > I don't think limiting addresses to PCI physical addresses will work > well. The only "limit" is imposed by the memctx. If a given context needs to meet certain requirements beyond PCI physical addresses, it would presumably be designed that way. > From what I rememeber, Ira's x86 can not initiate burst > transactions on PCI, and it's the ppc that initiates all DMA. The only requirement is that the "guest" "owns" the memory. IOW: As with virtio/vhost, the guest can access the pointers in the ring directly but the host must pass through a translation function. Your translation is direct: you use a slots/hva scheme. My translation is abstracted, which means it can support slots/hva (such as in alacrityvm) or some other scheme as long as the general model of "guest owned" holds true. > >>> But we can't let the guest specify physical addresses. >> Agreed. Neither your proposal nor mine operate this way afaict. > > But this seems to be what Ira needs. So what he could do then is implement the memctx to integrate with the ppc side dma controller. E.g. "translation" in his box means a protocol from the x86 to the ppc to initiate the dma cycle. This could be exposed as a dma facility in the register file of the ppc boards, for instance. To reiterate, as long as the model is such that the ppc boards are considered the "owner" (direct access, no translation needed) I believe it will work. If the pointers are expected to be owned by the host, then my model doesn't work well either. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote: > >> FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, > >> the memory is not assumed to be a userspace address. Rather, it is a > >> memctx-specific address, which can be userspace, or any other type > >> (including hardware, dma-engine, etc). As long as the memctx knows how > >> to translate it, it will work. > > > > How would permissions be handled? > > Same as anything else, really. Read on for details. > > > it's easy to allow an app to pass in virtual addresses in its own address > > space. > > Agreed, and this is what I do. > > The guest always passes its own physical addresses (using things like > __pa() in linux). This address passed is memctx specific, but generally > would fall into the category of "virtual-addresses" from the hosts > perspective. > > For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed > internally to the context via a gfn_to_hva conversion (you can see this > occuring in the citation links I sent) > > For Ira's example, the addresses would represent a physical address on > the PCI boards, and would follow any kind of relevant rules for > converting a "GPA" to a host accessible address (even if indirectly, via > a dma controller). So vbus can let an application access either its own virtual memory or a physical memory on a PCI device. My question is, is any application that's allowed to do the former also granted rights to do the later? > > But we can't let the guest specify physical addresses. > > Agreed. Neither your proposal nor mine operate this way afaict. > > HTH > > Kind Regards, > -Greg > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote: > For Ira's example, the addresses would represent a physical address on > the PCI boards, and would follow any kind of relevant rules for > converting a "GPA" to a host accessible address (even if indirectly, via > a dma controller). I don't think limiting addresses to PCI physical addresses will work well. From what I rememeber, Ira's x86 can not initiate burst transactions on PCI, and it's the ppc that initiates all DMA. > > > But we can't let the guest specify physical addresses. > > Agreed. Neither your proposal nor mine operate this way afaict. But this seems to be what Ira needs. > HTH > > Kind Regards, > -Greg > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: > On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote: >> FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, >> the memory is not assumed to be a userspace address. Rather, it is a >> memctx-specific address, which can be userspace, or any other type >> (including hardware, dma-engine, etc). As long as the memctx knows how >> to translate it, it will work. > > How would permissions be handled? Same as anything else, really. Read on for details. > it's easy to allow an app to pass in virtual addresses in its own address > space. Agreed, and this is what I do. The guest always passes its own physical addresses (using things like __pa() in linux). This address passed is memctx specific, but generally would fall into the category of "virtual-addresses" from the hosts perspective. For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed internally to the context via a gfn_to_hva conversion (you can see this occuring in the citation links I sent) For Ira's example, the addresses would represent a physical address on the PCI boards, and would follow any kind of relevant rules for converting a "GPA" to a host accessible address (even if indirectly, via a dma controller). > But we can't let the guest specify physical addresses. Agreed. Neither your proposal nor mine operate this way afaict. HTH Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Mon, Sep 14, 2009 at 01:57:06PM +0800, Xin, Xiaohui wrote: > >The irqfd/ioeventfd patches are part of Avi's kvm.git tree: > >git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git > > > >I expect them to be merged by 2.6.32-rc1 - right, Avi? > > Michael, > > I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed > the qemu side patch for irqfd and ioeventfd. > > I met the compile error when I compiled virtio-pci.c file in qemu-kvm like > this: > > /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` > undeclared (first use in this function) > /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` > undeclared (first use in this function) > > Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd? I'm using the headers from upstream kernel. I'll send a patch for that. > Thanks > Xiaohui > > -Original Message- > From: Michael S. Tsirkin [mailto:m...@redhat.com] > Sent: Sunday, September 13, 2009 1:46 PM > To: Xin, Xiaohui > Cc: Ira W. Snyder; net...@vger.kernel.org; > virtualization@lists.linux-foundation.org; k...@vger.kernel.org; > linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; > a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty > Russell; s.he...@linux-ag.com; a...@redhat.com > Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server > > On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote: > > Michael, > > We are very interested in your patch and want to have a try with it. > > I have collected your 3 patches in kernel side and 4 patches in queue side. > > The patches are listed here: > > > > PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch > > PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch > > PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch > > > > PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch > > PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch > > PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch > > PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch > > > > I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest > > kvm qemu. > > But seems there are some patches are needed at least irqfd and ioeventfd > > patches on > > current qemu. I cannot create a kvm guest with "-net > > nic,model=virtio,vhost=vethX". > > > > May you kindly advice us the patch lists all exactly to make it work? > > Thanks a lot. :-) > > > > Thanks > > Xiaohui > > > The irqfd/ioeventfd patches are part of Avi's kvm.git tree: > git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git > > I expect them to be merged by 2.6.32-rc1 - right, Avi? > > -- > MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
>The irqfd/ioeventfd patches are part of Avi's kvm.git tree: >git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git > >I expect them to be merged by 2.6.32-rc1 - right, Avi? Michael, I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd. I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this: /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function) /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function) Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd? Thanks Xiaohui -Original Message- From: Michael S. Tsirkin [mailto:m...@redhat.com] Sent: Sunday, September 13, 2009 1:46 PM To: Xin, Xiaohui Cc: Ira W. Snyder; net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty Russell; s.he...@linux-ag.com; a...@redhat.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote: > Michael, > We are very interested in your patch and want to have a try with it. > I have collected your 3 patches in kernel side and 4 patches in queue side. > The patches are listed here: > > PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch > PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch > PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch > > PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch > PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch > PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch > PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch > > I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest > kvm qemu. > But seems there are some patches are needed at least irqfd and ioeventfd > patches on > current qemu. I cannot create a kvm guest with "-net > nic,model=virtio,vhost=vethX". > > May you kindly advice us the patch lists all exactly to make it work? > Thanks a lot. :-) > > Thanks > Xiaohui The irqfd/ioeventfd patches are part of Avi's kvm.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git I expect them to be merged by 2.6.32-rc1 - right, Avi? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote: > FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, > the memory is not assumed to be a userspace address. Rather, it is a > memctx-specific address, which can be userspace, or any other type > (including hardware, dma-engine, etc). As long as the memctx knows how > to translate it, it will work. How would permissions be handled? it's easy to allow an app to pass in virtual addresses in its own address space. But we can't let the guest specify physical addresses. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote: > Michael, > We are very interested in your patch and want to have a try with it. > I have collected your 3 patches in kernel side and 4 patches in queue side. > The patches are listed here: > > PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch > PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch > PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch > > PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch > PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch > PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch > PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch > > I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest > kvm qemu. > But seems there are some patches are needed at least irqfd and ioeventfd > patches on > current qemu. I cannot create a kvm guest with "-net > nic,model=virtio,vhost=vethX". > > May you kindly advice us the patch lists all exactly to make it work? > Thanks a lot. :-) > > Thanks > Xiaohui The irqfd/ioeventfd patches are part of Avi's kvm.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git I expect them to be merged by 2.6.32-rc1 - right, Avi? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Ira W. Snyder wrote: > On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: >> On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: >>> On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy >>> Hello Michael, >>> >>> I've started looking at vhost with the intention of using it over PCI to >>> connect physical machines together. >>> >>> The part that I am struggling with the most is figuring out which parts >>> of the rings are in the host's memory, and which parts are in the >>> guest's memory. >> All rings are in guest's memory, to match existing virtio code. > > Ok, this makes sense. > >> vhost >> assumes that the memory space of the hypervisor userspace process covers >> the whole of guest memory. > > Is this necessary? Why? The assumption seems very wrong when you're > doing data transport between two physical systems via PCI. FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, the memory is not assumed to be a userspace address. Rather, it is a memctx-specific address, which can be userspace, or any other type (including hardware, dma-engine, etc). As long as the memctx knows how to translate it, it will work. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Gregory Haskins wrote: [snip] > > FWIW: VBUS handles this situation via the "memctx" abstraction. IOW, > the memory is not assumed to be a userspace address. Rather, it is a > memctx-specific address, which can be userspace, or any other type > (including hardware, dma-engine, etc). As long as the memctx knows how > to translate it, it will work. > citations: Here is a packet import (from the perspective of the host side "venet" device model, similar to Michaels "vhost") http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/devices/venet-tap.c;h=ee091c47f06e9bb8487a45e72d493273fe08329f;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l535 Here is the KVM specific memctx: http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/kvm.c;h=56e2c5682a7ca8432c159377b0f7389cf34cbc1b;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l188 and http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=virt/kvm/xinterface.c;h=0cccb6095ca2a51bad01f7ba2137fdd9111b63d3;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l289 You could alternatively define a memctx for your environment which knows how to deal with your PPC boards PCI based memory, and the devices would all "just work". Kind Regards, -Greg signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael, We are very interested in your patch and want to have a try with it. I have collected your 3 patches in kernel side and 4 patches in queue side. The patches are listed here: PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu. But seems there are some patches are needed at least irqfd and ioeventfd patches on current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX". May you kindly advice us the patch lists all exactly to make it work? Thanks a lot. :-) Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Wednesday, September 09, 2009 4:14 AM To: Ira W. Snyder Cc: net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com; Rusty Russell; s.he...@linux-ag.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote: > On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: > > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: > > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > > > > What it is: vhost net is a character device that can be used to reduce > > > > the number of system calls involved in virtio networking. > > > > Existing virtio net code is used in the guest without modification. > > > > > > > > There's similarity with vringfd, with some differences and reduced scope > > > > - uses eventfd for signalling > > > > - structures can be moved around in memory at any time (good for > > > > migration) > > > > - support memory table and not just an offset (needed for kvm) > > > > > > > > common virtio related code has been put in a separate file vhost.c and > > > > can be made into a separate module if/when more backends appear. I used > > > > Rusty's lguest.c as the source for developing this part : this supplied > > > > me with witty comments I wouldn't be able to write myself. > > > > > > > > What it is not: vhost net is not a bus, and not a generic new system > > > > call. No assumptions are made on how guest performs hypercalls. > > > > Userspace hypervisors are supported as well as kvm. > > > > > > > > How it works: Basically, we connect virtio frontend (configured by > > > > userspace) to a backend. The backend could be a network device, or a > > > > tun-like device. In this version I only support raw socket as a backend, > > > > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > > > > also configured by userspace, including vlan/mac etc. > > > > > > > > Status: > > > > This works for me, and I haven't see any crashes. > > > > I have done some light benchmarking (with v4), compared to userspace, I > > > > see improved latency (as I save up to 4 system calls per packet) but not > > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > > > > ping benchmark (where there's no TSO) troughput is also improved. > > > > > > > > Features that I plan to look at in the future: > > > > - tap support > > > > - TSO > > > > - interrupt mitigation > > > > - zero copy > > > > > > > > > > Hello Michael, > > > > > > I've started looking at vhost with the intention of using it over PCI to > > > connect physical machines together. > > > > > > The part that I am struggling with the most is figuring out which parts > > > of the rings are in the host's memory, and which parts are in the > > > guest's memory. > > > > All rings are in guest's memory, to match existing virtio code. > > Ok, this makes sense. > > > vhost > > assumes that the memory space of the hypervisor userspace process covers > > the whole of guest memory. > > Is this necessary? Why? Because with virtio ring can give
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote: > On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: > > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: > > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > > > > What it is: vhost net is a character device that can be used to reduce > > > > the number of system calls involved in virtio networking. > > > > Existing virtio net code is used in the guest without modification. > > > > > > > > There's similarity with vringfd, with some differences and reduced scope > > > > - uses eventfd for signalling > > > > - structures can be moved around in memory at any time (good for > > > > migration) > > > > - support memory table and not just an offset (needed for kvm) > > > > > > > > common virtio related code has been put in a separate file vhost.c and > > > > can be made into a separate module if/when more backends appear. I used > > > > Rusty's lguest.c as the source for developing this part : this supplied > > > > me with witty comments I wouldn't be able to write myself. > > > > > > > > What it is not: vhost net is not a bus, and not a generic new system > > > > call. No assumptions are made on how guest performs hypercalls. > > > > Userspace hypervisors are supported as well as kvm. > > > > > > > > How it works: Basically, we connect virtio frontend (configured by > > > > userspace) to a backend. The backend could be a network device, or a > > > > tun-like device. In this version I only support raw socket as a backend, > > > > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > > > > also configured by userspace, including vlan/mac etc. > > > > > > > > Status: > > > > This works for me, and I haven't see any crashes. > > > > I have done some light benchmarking (with v4), compared to userspace, I > > > > see improved latency (as I save up to 4 system calls per packet) but not > > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > > > > ping benchmark (where there's no TSO) troughput is also improved. > > > > > > > > Features that I plan to look at in the future: > > > > - tap support > > > > - TSO > > > > - interrupt mitigation > > > > - zero copy > > > > > > > > > > Hello Michael, > > > > > > I've started looking at vhost with the intention of using it over PCI to > > > connect physical machines together. > > > > > > The part that I am struggling with the most is figuring out which parts > > > of the rings are in the host's memory, and which parts are in the > > > guest's memory. > > > > All rings are in guest's memory, to match existing virtio code. > > Ok, this makes sense. > > > vhost > > assumes that the memory space of the hypervisor userspace process covers > > the whole of guest memory. > > Is this necessary? Why? Because with virtio ring can give us arbitrary guest addresses. If guest was limited to using a subset of addresses, hypervisor would only have to map these. > The assumption seems very wrong when you're > doing data transport between two physical systems via PCI. > I know vhost has not been designed for this specific situation, but it > is good to be looking toward other possible uses. > > > And there's a translation table. > > Ring addresses are userspace addresses, they do not undergo translation. > > > > > If I understand everything correctly, the rings are all userspace > > > addresses, which means that they can be moved around in physical memory, > > > and get pushed out to swap. > > > > Unless they are locked, yes. > > > > > AFAIK, this is impossible to handle when > > > connecting two physical systems, you'd need the rings available in IO > > > memory (PCI memory), so you can ioreadXX() them instead. To the best of > > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address. > > > Also, having them migrate around in memory would be a bad thing. > > > > > > Also, I'm having trouble figuring out how the packet contents are > > > actually copied from one system to the other. Could you point this out > > > for me? > > > > The code in net/packet/af_packet.c does it when vhost calls sendmsg. > > > > Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible > to make this use a DMA engine instead? Maybe. > I know this was suggested in an earlier thread. Yes, it might even give some performance benefit with e.g. I/O AT. > > > Is there somewhere I can find the userspace code (kvm, qemu, lguest, > > > etc.) code needed for interacting with the vhost misc device so I can > > > get a better idea of how userspace is supposed to work? > > > > Look in archives for k...@vger.kernel.org. the subject is qemu-kvm: vhost > > net. > > > > > (Features > > > negotiation, etc.) > > > > > > > That's not yet implemented as there are no features yet. I'm working on > > tap support, which will add a feature bit. Overall, qemu does an ioctl > > to query supported features, and then a
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > > > What it is: vhost net is a character device that can be used to reduce > > > the number of system calls involved in virtio networking. > > > Existing virtio net code is used in the guest without modification. > > > > > > There's similarity with vringfd, with some differences and reduced scope > > > - uses eventfd for signalling > > > - structures can be moved around in memory at any time (good for > > > migration) > > > - support memory table and not just an offset (needed for kvm) > > > > > > common virtio related code has been put in a separate file vhost.c and > > > can be made into a separate module if/when more backends appear. I used > > > Rusty's lguest.c as the source for developing this part : this supplied > > > me with witty comments I wouldn't be able to write myself. > > > > > > What it is not: vhost net is not a bus, and not a generic new system > > > call. No assumptions are made on how guest performs hypercalls. > > > Userspace hypervisors are supported as well as kvm. > > > > > > How it works: Basically, we connect virtio frontend (configured by > > > userspace) to a backend. The backend could be a network device, or a > > > tun-like device. In this version I only support raw socket as a backend, > > > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > > > also configured by userspace, including vlan/mac etc. > > > > > > Status: > > > This works for me, and I haven't see any crashes. > > > I have done some light benchmarking (with v4), compared to userspace, I > > > see improved latency (as I save up to 4 system calls per packet) but not > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > > > ping benchmark (where there's no TSO) troughput is also improved. > > > > > > Features that I plan to look at in the future: > > > - tap support > > > - TSO > > > - interrupt mitigation > > > - zero copy > > > > > > > Hello Michael, > > > > I've started looking at vhost with the intention of using it over PCI to > > connect physical machines together. > > > > The part that I am struggling with the most is figuring out which parts > > of the rings are in the host's memory, and which parts are in the > > guest's memory. > > All rings are in guest's memory, to match existing virtio code. Ok, this makes sense. > vhost > assumes that the memory space of the hypervisor userspace process covers > the whole of guest memory. Is this necessary? Why? The assumption seems very wrong when you're doing data transport between two physical systems via PCI. I know vhost has not been designed for this specific situation, but it is good to be looking toward other possible uses. > And there's a translation table. > Ring addresses are userspace addresses, they do not undergo translation. > > > If I understand everything correctly, the rings are all userspace > > addresses, which means that they can be moved around in physical memory, > > and get pushed out to swap. > > Unless they are locked, yes. > > > AFAIK, this is impossible to handle when > > connecting two physical systems, you'd need the rings available in IO > > memory (PCI memory), so you can ioreadXX() them instead. To the best of > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address. > > Also, having them migrate around in memory would be a bad thing. > > > > Also, I'm having trouble figuring out how the packet contents are > > actually copied from one system to the other. Could you point this out > > for me? > > The code in net/packet/af_packet.c does it when vhost calls sendmsg. > Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible to make this use a DMA engine instead? I know this was suggested in an earlier thread. > > Is there somewhere I can find the userspace code (kvm, qemu, lguest, > > etc.) code needed for interacting with the vhost misc device so I can > > get a better idea of how userspace is supposed to work? > > Look in archives for k...@vger.kernel.org. the subject is qemu-kvm: vhost net. > > > (Features > > negotiation, etc.) > > > > That's not yet implemented as there are no features yet. I'm working on > tap support, which will add a feature bit. Overall, qemu does an ioctl > to query supported features, and then acks them with another ioctl. I'm > also trying to avoid duplicating functionality available elsewhere. So > that to check e.g. TSO support, you'd just look at the underlying > hardware device you are binding to. > Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in the future? I found that this made an enormous improvement in throughput on my virtio-net <-> virtio-net system. Perhaps it isn't needed with vhost-net. Thanks for replying, Ira ___
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > > What it is: vhost net is a character device that can be used to reduce > > the number of system calls involved in virtio networking. > > Existing virtio net code is used in the guest without modification. > > > > There's similarity with vringfd, with some differences and reduced scope > > - uses eventfd for signalling > > - structures can be moved around in memory at any time (good for migration) > > - support memory table and not just an offset (needed for kvm) > > > > common virtio related code has been put in a separate file vhost.c and > > can be made into a separate module if/when more backends appear. I used > > Rusty's lguest.c as the source for developing this part : this supplied > > me with witty comments I wouldn't be able to write myself. > > > > What it is not: vhost net is not a bus, and not a generic new system > > call. No assumptions are made on how guest performs hypercalls. > > Userspace hypervisors are supported as well as kvm. > > > > How it works: Basically, we connect virtio frontend (configured by > > userspace) to a backend. The backend could be a network device, or a > > tun-like device. In this version I only support raw socket as a backend, > > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > > also configured by userspace, including vlan/mac etc. > > > > Status: > > This works for me, and I haven't see any crashes. > > I have done some light benchmarking (with v4), compared to userspace, I > > see improved latency (as I save up to 4 system calls per packet) but not > > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > > ping benchmark (where there's no TSO) troughput is also improved. > > > > Features that I plan to look at in the future: > > - tap support > > - TSO > > - interrupt mitigation > > - zero copy > > > > Hello Michael, > > I've started looking at vhost with the intention of using it over PCI to > connect physical machines together. > > The part that I am struggling with the most is figuring out which parts > of the rings are in the host's memory, and which parts are in the > guest's memory. All rings are in guest's memory, to match existing virtio code. vhost assumes that the memory space of the hypervisor userspace process covers the whole of guest memory. And there's a translation table. Ring addresses are userspace addresses, they do not undergo translation. > If I understand everything correctly, the rings are all userspace > addresses, which means that they can be moved around in physical memory, > and get pushed out to swap. Unless they are locked, yes. > AFAIK, this is impossible to handle when > connecting two physical systems, you'd need the rings available in IO > memory (PCI memory), so you can ioreadXX() them instead. To the best of > my knowledge, I shouldn't be using copy_to_user() on an __iomem address. > Also, having them migrate around in memory would be a bad thing. > > Also, I'm having trouble figuring out how the packet contents are > actually copied from one system to the other. Could you point this out > for me? The code in net/packet/af_packet.c does it when vhost calls sendmsg. > Is there somewhere I can find the userspace code (kvm, qemu, lguest, > etc.) code needed for interacting with the vhost misc device so I can > get a better idea of how userspace is supposed to work? Look in archives for k...@vger.kernel.org. the subject is qemu-kvm: vhost net. > (Features > negotiation, etc.) > > Thanks, > Ira That's not yet implemented as there are no features yet. I'm working on tap support, which will add a feature bit. Overall, qemu does an ioctl to query supported features, and then acks them with another ioctl. I'm also trying to avoid duplicating functionality available elsewhere. So that to check e.g. TSO support, you'd just look at the underlying hardware device you are binding to. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: > What it is: vhost net is a character device that can be used to reduce > the number of system calls involved in virtio networking. > Existing virtio net code is used in the guest without modification. > > There's similarity with vringfd, with some differences and reduced scope > - uses eventfd for signalling > - structures can be moved around in memory at any time (good for migration) > - support memory table and not just an offset (needed for kvm) > > common virtio related code has been put in a separate file vhost.c and > can be made into a separate module if/when more backends appear. I used > Rusty's lguest.c as the source for developing this part : this supplied > me with witty comments I wouldn't be able to write myself. > > What it is not: vhost net is not a bus, and not a generic new system > call. No assumptions are made on how guest performs hypercalls. > Userspace hypervisors are supported as well as kvm. > > How it works: Basically, we connect virtio frontend (configured by > userspace) to a backend. The backend could be a network device, or a > tun-like device. In this version I only support raw socket as a backend, > which can be bound to e.g. SR IOV, or to macvlan device. Backend is > also configured by userspace, including vlan/mac etc. > > Status: > This works for me, and I haven't see any crashes. > I have done some light benchmarking (with v4), compared to userspace, I > see improved latency (as I save up to 4 system calls per packet) but not > bandwidth/CPU (as TSO and interrupt mitigation are not supported). For > ping benchmark (where there's no TSO) troughput is also improved. > > Features that I plan to look at in the future: > - tap support > - TSO > - interrupt mitigation > - zero copy > Hello Michael, I've started looking at vhost with the intention of using it over PCI to connect physical machines together. The part that I am struggling with the most is figuring out which parts of the rings are in the host's memory, and which parts are in the guest's memory. If I understand everything correctly, the rings are all userspace addresses, which means that they can be moved around in physical memory, and get pushed out to swap. AFAIK, this is impossible to handle when connecting two physical systems, you'd need the rings available in IO memory (PCI memory), so you can ioreadXX() them instead. To the best of my knowledge, I shouldn't be using copy_to_user() on an __iomem address. Also, having them migrate around in memory would be a bad thing. Also, I'm having trouble figuring out how the packet contents are actually copied from one system to the other. Could you point this out for me? Is there somewhere I can find the userspace code (kvm, qemu, lguest, etc.) code needed for interacting with the vhost misc device so I can get a better idea of how userspace is supposed to work? (Features negotiation, etc.) Thanks, Ira > Acked-by: Arnd Bergmann > Signed-off-by: Michael S. Tsirkin > > --- > MAINTAINERS| 10 + > arch/x86/kvm/Kconfig |1 + > drivers/Makefile |1 + > drivers/vhost/Kconfig | 11 + > drivers/vhost/Makefile |2 + > drivers/vhost/net.c| 475 ++ > drivers/vhost/vhost.c | 688 > > drivers/vhost/vhost.h | 122 > include/linux/Kbuild |1 + > include/linux/miscdevice.h |1 + > include/linux/vhost.h | 101 +++ > 11 files changed, 1413 insertions(+), 0 deletions(-) > create mode 100644 drivers/vhost/Kconfig > create mode 100644 drivers/vhost/Makefile > create mode 100644 drivers/vhost/net.c > create mode 100644 drivers/vhost/vhost.c > create mode 100644 drivers/vhost/vhost.h > create mode 100644 include/linux/vhost.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index b1114cf..de4587f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5431,6 +5431,16 @@ S: Maintained > F: Documentation/filesystems/vfat.txt > F: fs/fat/ > > +VIRTIO HOST (VHOST) > +P: Michael S. Tsirkin > +M: m...@redhat.com > +L: k...@vger.kernel.org > +L: virtualizat...@lists.osdl.org > +L: net...@vger.kernel.org > +S: Maintained > +F: drivers/vhost/ > +F: include/linux/vhost.h > + > VIA RHINE NETWORK DRIVER > M: Roger Luethi > S: Maintained > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index b84e571..94f44d9 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -64,6 +64,7 @@ config KVM_AMD > > # OK, it's a little counter-intuitive to do this, but it puts it neatly under > # the virtualization menu. > +source drivers/vhost/Kconfig > source drivers/lguest/Kconfig > source drivers/virtio/Kconfig > > diff --git a/drivers/Makefile b/drivers/Makefile > index bc4205d..1551ae1 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -105,6 +105,7 @@ obj-$(CONFIG_HID)
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. BTW, we will send our VMDq write-up very soon. Thanks Xiaohui -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Michael S. Tsirkin Sent: Wednesday, August 19, 2009 11:03 PM To: net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: [PATCHv4 2/2] vhost_net: a kernel-level virtio server What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have not run any benchmarks yet, compared to userspace, I expect to see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). Features that I plan to look at in the future: - TSO - interrupt mitigation - zero copy Acked-by: Arnd Bergmann Signed-off-by: Michael S. Tsirkin --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 429 drivers/vhost/vhost.c | 664 drivers/vhost/vhost.h | 108 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 100 +++ 11 files changed, 1328 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: k...@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
>I don't think we should do that with the tun/tap driver. By design, tun/tap is >a way to interact >with the >networking stack as if coming from a device. The only way this connects to an >external >adapter is through >a bridge or through IP routing, which means that it does not correspond to a >specific NIC. >I have worked on a driver I called 'macvtap' in lack of a better name, to add >a new tap >frontend to >the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC >device, this gives you >a direct >connection between one or multiple tap devices to an external NIC, which works >a lot better >than when >you have a bridge inbetween. There is also work underway to add a bridging >capability to >macvlan, so >you can communicate directly between guests like you can do with a bridge. >Michael's vhost_net can plug into the same macvlan infrastructure, so the work >is >complementary. We use TUN/TAP device to implement the prototype, and agree that it's not the only choice here. We'd compare the two if possible. And what we cares more about is the modification in the kernel like the net_dev and skb structures' modifications, thanks. Thanks Xiaohui -Original Message- From: Arnd Bergmann [mailto:a...@arndb.de] Sent: Monday, August 31, 2009 11:24 PM To: Xin, Xiaohui Cc: m...@redhat.com; net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On Monday 31 August 2009, Xin, Xiaohui wrote: > > Hi, Michael > That's a great job. We are now working on support VMDq on KVM, and since the > VMDq hardware presents L2 sorting > based on MAC addresses and VLAN tags, our target is to implement a zero copy > solution using VMDq. I'm also interested in helping there, please include me in the discussions. > We stared > from the virtio-net architecture. What we want to proposal is to use AIO > combined with direct I/O: > 1) Modify virtio-net Backend service in Qemu to submit aio requests composed > from virtqueue. right, that sounds useful. > 2) Modify TUN/TAP device to support aio operations and the user space buffer > directly mapping into the host kernel. > 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. I don't think we should do that with the tun/tap driver. By design, tun/tap is a way to interact with the networking stack as if coming from a device. The only way this connects to an external adapter is through a bridge or through IP routing, which means that it does not correspond to a specific NIC. I have worked on a driver I called 'macvtap' in lack of a better name, to add a new tap frontend to the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC device, this gives you a direct connection between one or multiple tap devices to an external NIC, which works a lot better than when you have a bridge inbetween. There is also work underway to add a bridging capability to macvlan, so you can communicate directly between guests like you can do with a bridge. Michael's vhost_net can plug into the same macvlan infrastructure, so the work is complementary. > 4) Modify the net_dev and skb structure to permit allocated skb to use user > space directly mapped payload > buffer address rather then kernel allocated. yes. > As zero copy is also your goal, we are interested in what's in your mind, and > would like to collaborate with you if possible. > BTW, we will send our VMDq write-up very soon. Ok, cool. Arnd <>< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
> One way to share the effort is to make vmdq queues available as normal kernel interfaces. It would take quite a bit of work, but the end result is that no other components need to be change, and it makes vmdq useful outside kvm. It also greatly reduces the amount of integration work needed throughout the stack (kvm/qemu/libvirt). Yes. The common queue pair interface which we want to present will also apply to normal hardware, and try to leave other components unknown. Thanks Xiaohui -Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Tuesday, September 01, 2009 1:52 AM To: Xin, Xiaohui Cc: m...@redhat.com; net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: > Hi, Michael > That's a great job. We are now working on support VMDq on KVM, and since the > VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our > target is to implement a zero copy solution using VMDq. We stared from the > virtio-net architecture. What we want to proposal is to use AIO combined with > direct I/O: > 1) Modify virtio-net Backend service in Qemu to submit aio requests composed > from virtqueue. > 2) Modify TUN/TAP device to support aio operations and the user space buffer > directly mapping into the host kernel. > 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. > 4) Modify the net_dev and skb structure to permit allocated skb to use user > space directly mapped payload buffer address rather then kernel allocated. > > As zero copy is also your goal, we are interested in what's in your mind, and > would like to collaborate with you if possible. > One way to share the effort is to make vmdq queues available as normal kernel interfaces. It would take quite a bit of work, but the end result is that no other components need to be change, and it makes vmdq useful outside kvm. It also greatly reduces the amount of integration work needed throughout the stack (kvm/qemu/libvirt). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
>It may be possible to make vmdq appear like an sr-iov capable device >from userspace. sr-iov provides the userspace interfaces to allocate >interfaces and assign mac addresses. To make it useful, you would have >to handle tx multiplexing in the driver but that would be much easier to >consume for kvm What we have thought is to support multiple net_dev structures according to multiple queue pairs of one vmdq adapter and presents multiple mac address in user space and each one mac can be used by a guest. What does the tx multiplexing in the driver exactly mean? Thanks Xiaohui -Original Message- From: Anthony Liguori [mailto:anth...@codemonkey.ws] Sent: Tuesday, September 01, 2009 5:57 AM To: Avi Kivity Cc: Xin, Xiaohui; m...@redhat.com; net...@vger.kernel.org; virtualization@lists.linux-foundation.org; k...@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; linux...@kvack.org; a...@linux-foundation.org; h...@zytor.com; gregory.hask...@gmail.com Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server Avi Kivity wrote: > On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: >> Hi, Michael >> That's a great job. We are now working on support VMDq on KVM, and >> since the VMDq hardware presents L2 sorting based on MAC addresses >> and VLAN tags, our target is to implement a zero copy solution using >> VMDq. We stared from the virtio-net architecture. What we want to >> proposal is to use AIO combined with direct I/O: >> 1) Modify virtio-net Backend service in Qemu to submit aio requests >> composed from virtqueue. >> 2) Modify TUN/TAP device to support aio operations and the user space >> buffer directly mapping into the host kernel. >> 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. >> 4) Modify the net_dev and skb structure to permit allocated skb to >> use user space directly mapped payload buffer address rather then >> kernel allocated. >> >> As zero copy is also your goal, we are interested in what's in your >> mind, and would like to collaborate with you if possible. >> > > One way to share the effort is to make vmdq queues available as normal > kernel interfaces. It may be possible to make vmdq appear like an sr-iov capable device from userspace. sr-iov provides the userspace interfaces to allocate interfaces and assign mac addresses. To make it useful, you would have to handle tx multiplexing in the driver but that would be much easier to consume for kvm. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: > On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: >> Hi, Michael >> That's a great job. We are now working on support VMDq on KVM, and >> since the VMDq hardware presents L2 sorting based on MAC addresses >> and VLAN tags, our target is to implement a zero copy solution using >> VMDq. We stared from the virtio-net architecture. What we want to >> proposal is to use AIO combined with direct I/O: >> 1) Modify virtio-net Backend service in Qemu to submit aio requests >> composed from virtqueue. >> 2) Modify TUN/TAP device to support aio operations and the user space >> buffer directly mapping into the host kernel. >> 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. >> 4) Modify the net_dev and skb structure to permit allocated skb to >> use user space directly mapped payload buffer address rather then >> kernel allocated. >> >> As zero copy is also your goal, we are interested in what's in your >> mind, and would like to collaborate with you if possible. >> > > One way to share the effort is to make vmdq queues available as normal > kernel interfaces. It may be possible to make vmdq appear like an sr-iov capable device from userspace. sr-iov provides the userspace interfaces to allocate interfaces and assign mac addresses. To make it useful, you would have to handle tx multiplexing in the driver but that would be much easier to consume for kvm. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: > Hi, Michael > That's a great job. We are now working on support VMDq on KVM, and since the > VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our > target is to implement a zero copy solution using VMDq. We stared from the > virtio-net architecture. What we want to proposal is to use AIO combined with > direct I/O: > 1) Modify virtio-net Backend service in Qemu to submit aio requests composed > from virtqueue. > 2) Modify TUN/TAP device to support aio operations and the user space buffer > directly mapping into the host kernel. > 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. > 4) Modify the net_dev and skb structure to permit allocated skb to use user > space directly mapped payload buffer address rather then kernel allocated. > > As zero copy is also your goal, we are interested in what's in your mind, and > would like to collaborate with you if possible. > One way to share the effort is to make vmdq queues available as normal kernel interfaces. It would take quite a bit of work, but the end result is that no other components need to be change, and it makes vmdq useful outside kvm. It also greatly reduces the amount of integration work needed throughout the stack (kvm/qemu/libvirt). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Monday 31 August 2009, Xin, Xiaohui wrote: > > Hi, Michael > That's a great job. We are now working on support VMDq on KVM, and since the > VMDq hardware presents L2 sorting > based on MAC addresses and VLAN tags, our target is to implement a zero copy > solution using VMDq. I'm also interested in helping there, please include me in the discussions. > We stared > from the virtio-net architecture. What we want to proposal is to use AIO > combined with direct I/O: > 1) Modify virtio-net Backend service in Qemu to submit aio requests composed > from virtqueue. right, that sounds useful. > 2) Modify TUN/TAP device to support aio operations and the user space buffer > directly mapping into the host kernel. > 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. I don't think we should do that with the tun/tap driver. By design, tun/tap is a way to interact with the networking stack as if coming from a device. The only way this connects to an external adapter is through a bridge or through IP routing, which means that it does not correspond to a specific NIC. I have worked on a driver I called 'macvtap' in lack of a better name, to add a new tap frontend to the 'macvlan' driver. Since macvlan lets you add slaves to a single NIC device, this gives you a direct connection between one or multiple tap devices to an external NIC, which works a lot better than when you have a bridge inbetween. There is also work underway to add a bridging capability to macvlan, so you can communicate directly between guests like you can do with a bridge. Michael's vhost_net can plug into the same macvlan infrastructure, so the work is complementary. > 4) Modify the net_dev and skb structure to permit allocated skb to use user > space directly mapped payload > buffer address rather then kernel allocated. yes. > As zero copy is also your goal, we are interested in what's in your mind, and > would like to collaborate with you if possible. > BTW, we will send our VMDq write-up very soon. Ok, cool. Arnd <>< ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization