Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
On Tue, Sep 16, 2014 at 10:22:27PM -0700, Andy Lutomirski wrote: On non-PPC systems, virtio_pci should use the DMA API. This fixes virtio_pci on Xen. On PPC, using the DMA API would break things, so we need to preserve the old behavior. The big comment in this patch explains the considerations in more detail. Signed-off-by: Andy Lutomirski l...@amacapital.net --- drivers/virtio/virtio_pci.c | 90 - 1 file changed, 81 insertions(+), 9 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index a1f299fa4626..8ddb0a641878 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -80,8 +80,10 @@ struct virtio_pci_vq_info /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ - void *queue; + /* the ring queue */ + void *queue;/* virtual address */ + dma_addr_t queue_dma_addr; /* bus address */ + bool use_dma_api; /* are we using the DMA API? */ /* the list node for the virtqueues list */ struct list_head node; @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev) return err; } +static bool vp_use_dma_api(void) +{ + /* + * Due to limitations of the DMA API, we only have two choices: + * use the DMA API (e.g. set up IOMMU mappings or apply Xen's + * physical-to-machine translation) or use direct physical + * addressing. Furthermore, there's no sensible way yet for the + * PCI bus code to tell us whether we're supposed to act like a + * normal PCI device (and use the DMA API) or to do something + * else. So we're stuck with heuristics here. + * + * In general, we would prefer to use the DMA API, since we + * might be driving a physical device, and such devices *must* + * use the DMA API if there is an IOMMU involved. + * + * On x86, there are no physically-mapped emulated virtio PCI + * devices that live behind an IOMMU. On ARM, there don't seem + * to be any hypervisors that use virtio_pci (as opposed to + * virtio_mmio) that also emulate an IOMMU. So using the DMI Hi, I noticed a typo here. It should say DMA not DMI. Just thought I'd point it out. Ira + * API is safe. + * + * On PowerPC, it's the other way around. There usually is an + * IOMMU between us and the virtio PCI device, but the device is + * probably emulated and ignores the IOMMU. Unfortunately, we + * can't tell whether we're talking to an emulated device or to + * a physical device that really lives behind the IOMMU. That + * means that we're stuck with ignoring the DMA API. + */ + +#ifdef CONFIG_PPC + return false; +#else + /* + * Minor optimization: if the platform promises to have physical + * PCI DMA, we turn off DMA mapping in virtio_ring. If the + * platform's DMA API implementation is well optimized, this + * should have almost no effect, but we already have a branch in + * the vring code, and we can avoid any further indirection with + * very little effort. + */ + return !PCI_DMA_BUS_IS_PHYS; +#endif +} + static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index, void (*callback)(struct virtqueue *vq), const char *name, @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index, info-num = num; info-msix_vector = msix_vec; + info-use_dma_api = vp_use_dma_api(); - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + size = vring_size(num, VIRTIO_PCI_VRING_ALIGN); + if (info-use_dma_api) { + info-queue = dma_zalloc_coherent(vdev-dev.parent, size, + info-queue_dma_addr, + GFP_KERNEL); + } else { + info-queue = alloc_pages_exact(PAGE_ALIGN(size), + GFP_KERNEL|__GFP_ZERO); + info-queue_dma_addr = virt_to_phys(info-queue); + } if (info-queue == NULL) { err = -ENOMEM; goto out_info; } /* activate the queue */ - iowrite32(virt_to_phys(info-queue) VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info-queue_dma_addr VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ vq = vring_new_virtqueue(index, info-num, VIRTIO_PCI_VRING_ALIGN, vdev, - true, false, info-queue, + true, info-use_dma_api, info-queue,
Re: [RFC]vhost/vhost-net backend for PCI cards
On Wed, Feb 27, 2013 at 03:50:54AM -0800, Nikhil Rao wrote: We are implementing a driver for a PCIe card that runs Linux. This card needs virtual network/disk/console devices, so we have reused the virtio devices on on the card and provided a host backend that interacts with the virtio devices through the card's driver. this approach is very much like what was proposed on this thread http://permalink.gmane.org/gmane.linux.ports.sh.devel/10379 We will posting the driver soon, so perhaps I am jumping the gun with my question below about replacing our backend with vhost. It is possible for vhost (along with vhost-net in the case of virtio-net) to serve as the backend. The copy between virtio buffers and skbs happens in the tun/tap driver which means tun/tap may need to use a HW DMA engine (the card has one) for copy across the bus to get close to the full PCIe bandwidth. tun/tap was probably never designed for this use case, but reusing vhost does simplify our backend since it is now only involved in setup and potentially has a performance/memory footprint advantage due to avoiding context switches/intermediate buffer copy and this idea can be generalized to other cards as well. Comments/suggestions ? Thanks, Nikhil ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization Hi Nikhil, I don't have any code to offer, but may be able to provide some suggestions. I work on a system which has a single (x86) host computer, and many PowerPC data processing boards, which are connected via PCI. This sounds similar to your hardware. Our system was developed before vhost existed. I built a (fairly dumb) network driver that just transfers packets over PCI using the PowerPC DMA controller. It works, however I think a more generic virtio solution will work better. A virtio solution will also allow other types of devices besides a network interface. I have done some studying of rproc/rpmsg/vhost/vringh, and may have some suggestions about those pieces of kernel functionality. A HW DMA engine is absolutely needed to get good performance over the PCI bus. I don't have experience with PCIe. You may want to investigate rproc/rpmsg to help do virtio device discovery. When dealing with virtio, it may be helpful to think of your PCIe card as the host. In virtio nomenclature, the host is in charge of copying data. Your HW DMA engine needs to be controlled by the host. Your main computer (the computer the PCIe card plugs into) will be a virtio guest and will run the virtio-net/virtio-console/etc. drivers. Several vendors have contacted me privately to ask for the code for my (dumb) network-over-PCI driver. A generic solution to this problem will definitely find a userbase. I look forward to porting the code to run on my PowerPC PCI boards when it becomes available. I am able to help review code as well. Good luck! Ira ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC]vhost/vhost-net backend for PCI cards
On Wed, Feb 27, 2013 at 04:58:20AM -0800, Nikhil Rao wrote: On Wed, 2013-02-27 at 11:17 -0800, Ira W. Snyder wrote: Hi Nikhil, I don't have any code to offer, but may be able to provide some suggestions. I work on a system which has a single (x86) host computer, and many PowerPC data processing boards, which are connected via PCI. This sounds similar to your hardware. Our system was developed before vhost existed. I built a (fairly dumb) network driver that just transfers packets over PCI using the PowerPC DMA controller. It works, however I think a more generic virtio solution will work better. A virtio solution will also allow other types of devices besides a network interface. I have done some studying of rproc/rpmsg/vhost/vringh, and may have some suggestions about those pieces of kernel functionality. A HW DMA engine is absolutely needed to get good performance over the PCI bus. I don't have experience with PCIe. You may want to investigate rproc/rpmsg to help do virtio device discovery. When dealing with virtio, it may be helpful to think of your PCIe card as the host. We wanted to support a host-based disk, using virtio-blk on the card seemed to be a good way to do this, given that the card runs Linux. also from a performance perspective, which would be better ? [virtio-net on the card/backend on the host cpu] v/s [virtio-net on the hostcpu/backend on the card] ? given that the host cpu is more powerful than the card cpu. I never considered using virtio-blk, so I don't have any input about it. I don't know much about virtio performance either. The experts on this list will have to send their input. In virtio nomenclature, the host is in charge of copying data. Your HW DMA engine needs to be controlled by the host. In our case, the host driver controls the HW DMA engine (a subset of the card DMA engine channels are under host control). But you may be referring to the case where the host doesn't have access to the card's DMA engine. That's right. On our PowerPC cards, the DMA hardware can be controlled by either the PowerPC processor, or the PCI host system, but not both. We need the DMA for various operations on the PowerPC card itself, so the PCI host system cannot be used to control the DMA hardware. This is also true for other vendors who contacted me privately. Your PCIe card seems to have better features than any similar systems I've worked with. I skimmed the code being used with rpmsg/rproc/virtio on the ARM DSPs on the OMAP platform. They use the DSP as the virtio host (it copies memory, performing the same function as vhost) and the OMAP as the virtio guest (it runs virtio-net/etc.). Among all virtio guest drivers (virtio-net/virtio-console/etc.), I think virtio-blk behaves differently from the rest. All of the others work better with DMA hardware when the PCI card is the virtio host. You might want to contact Ohad Ben-Cohen for his advice. He did a lot of work on drivers/rpmsg and drivers/remoteproc. He works with the OMAP hardware. Ira Your main computer (the computer the PCIe card plugs into) will be a virtio guest and will run the virtio-net/virtio-console/etc. drivers. Several vendors have contacted me privately to ask for the code for my (dumb) network-over-PCI driver. A generic solution to this problem will definitely find a userbase. I look forward to porting the code to run on my PowerPC PCI boards when it becomes available. I am able to help review code as well. Good luck! Ira ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/02] virtio: Virtio platform driver
On Wed, Mar 16, 2011 at 02:17:15PM +0900, Magnus Damm wrote: Hi Rusty, On Wed, Mar 16, 2011 at 12:46 PM, Rusty Russell ru...@rustcorp.com.au wrote: On Thu, 10 Mar 2011 16:05:41 +0900, Magnus Damm magnus.d...@gmail.com wrote: virtio: Virtio platform driver [PATCH 01/02] virtio: Break out lguest virtio code to virtio_lguest.c [PATCH 02/02] virtio: Add virtio platform driver I have no problem with these patches, but it's just churn until we see your actual drivers. Well, actually this platform driver is used together with already existing drivers, so there are no new virtio drivers to wait for. The drivers that have been tested are so far: CONFIG_VIRTIO_CONSOLE=y CONFIG_VIRTIO_NET=y At this point there are four different pieces of code working together 1) Virtio platform driver patches (for guest) 2) SH4AL-DSP guest kernel patch 3) ARM UIO driver patches (for host) 4) User space backing code for ARM based on lguest.c These patches in this mail thread are 1), and I decided to brush up that portion and submit upstream because it's the part that is easiest to break out. I intend to post the rest bit by bit over time, but if someone is interested then I can post everything at once too. I'm very interested in the full series of patches. I want to do something similar to talk between two Linux kernels (x86 and PowerPC) connected by a PCI bus. Thanks, Ira The S/390 devs might be interested, as their bus is very similar too... The lguest device code is very similar as well, perhaps it's worth refactoring that a bit to build on top of the platform driver. Not sure if you see that as a move in the right direction though. Thanks for your feedback! / magnus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio over PCI
On Wed, Mar 03, 2010 at 05:09:48PM +1100, Michael Ellerman wrote: Hi guys, I was looking around at virtio over PCI stuff and noticed you had started some work on a driver. The last I can find via google is v2 from mid last year, is that as far as it got? http://lkml.org/lkml/2009/2/23/353 Yep, that is pretty much as far as I got. It was more-or-less rejected because I hooked two instances of virtio-net together, rather than having a proper backend and using virtio-net as the frontend. I got started on writing a backend, which was never posted to LKML because I never finished it. Feel free to take the code and use it to start your own project. Note that vhost-net exists now, and is an in-kernel backend for virtio-net. It *may* be possible to use this, rather than writing a userspace backend as I started to do. http://www.mmarray.org/~iws/virtio-phys/ I also got started with the alacrityvm project, developing a driver for their virtualization framework. That project is nowhere near finished. The virtualization folks basically told GHaskins (alacrityvm author) that alacrityvm wouldn't ever make it to mainline Linux. http://www.mmarray.org/~iws/vbus/ Unfortunately, I've been pulled onto other projects for the time being. However, I'd really like to be able to use a virtio-over-PCI style driver, rather than relying on my own custom (slow, unoptimized) network driver (PCINet). If you get something mostly working (and mostly agreed upon by the virtualization guys), I will make the time to test it and get it cleaned up. I've had 10+ people email me privately about this kind of driver now. It is an area where Linux is sorely lacking. I'm happy to provide any help I can, including testing on MPC8349EA-based system. I would suggest talking to the virtualization mailing list before you get too deep in the project. They sometimes have good advice. I've added them to the CC list, so maybe they can comment. https://lists.linux-foundation.org/mailman/listinfo/virtualization Good luck, and let me know if I can help. Ira ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Acked-by: Arnd Bergmann a...@arndb.de Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 475 ++ drivers/vhost/vhost.c | 688 drivers/vhost/vhost.h | 122 include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 101 +++ 11 files changed, 1413 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: k...@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3)+= ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB)+= ssb/ +obj-$(CONFIG_VHOST_NET) += vhost/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_STAGING)+= staging/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig new file mode 100644 index 000..d955406 --- /dev/null +++ b/drivers/vhost/Kconfig @@ -0,0 +1,11 @@ +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest networking with virtio_net. Not to be confused with virtio_net + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_net. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile new file mode 100644 index 000..72dd020 --- /dev/null +++ b/drivers/vhost/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_VHOST_NET) += vhost_net.o +vhost_net-y := vhost.o net.o diff --git
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote: On 09/24/2009 12:15 AM, Gregory Haskins wrote: There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV design that require to be placed in-kernel to maximize the performance yet properly isolate the guest. For instance, you are required to have your signal-path (interrupts and hypercalls), your memory-path (gpa translation), and addressing/isolation model in-kernel to maximize performance. Exactly. That's what vhost puts into the kernel and nothing more. Actually, no. Generally, _KVM_ puts those things into the kernel, and vhost consumes them. Without KVM (or something equivalent), vhost is incomplete. One of my goals with vbus is to generalize the something equivalent part here. I don't really see how vhost and vbus are different here. vhost expects signalling to happen through a couple of eventfds and requires someone to supply them and implement kernel support (if needed). vbus requires someone to write a connector to provide the signalling implementation. Neither will work out-of-the-box when implementing virtio-net over falling dominos, for example. Vbus accomplishes its in-kernel isolation model by providing a container concept, where objects are placed into this container by userspace. The host kernel enforces isolation/protection by using a namespace to identify objects that is only relevant within a specific container's context (namely, a u32 dev-id). The guest addresses the objects by its dev-id, and the kernel ensures that the guest can't access objects outside of its dev-id namespace. vhost manages to accomplish this without any kernel support. No, vhost manages to accomplish this because of KVMs kernel support (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a merely a kind of tuntap-like clone signalled by eventfds. Without a vbus-connector-falling-dominos, vbus-venet can't do anything either. Both vhost and vbus need an interface, vhost's is just narrower since it doesn't do configuration or enumeration. This goes directly to my rebuttal of your claim that vbus places too much in the kernel. I state that, one way or the other, address decode and isolation _must_ be in the kernel for performance. Vbus does this with a devid/container scheme. vhost+virtio-pci+kvm does it with pci+pio+ioeventfd. vbus doesn't do kvm guest address decoding for the fast path. It's still done by ioeventfd. The guest simply has not access to any vhost resources other than the guest-host doorbell, which is handed to the guest outside vhost (so it's somebody else's problem, in userspace). You mean _controlled_ by userspace, right? Obviously, the other side of the kernel still needs to be programmed (ioeventfd, etc). Otherwise, vhost would be pointless: e.g. just use vanilla tuntap if you don't need fast in-kernel decoding. Yes (though for something like level-triggered interrupts we're probably keeping it in userspace, enjoying the benefits of vhost data path while paying more for signalling). All that is required is a way to transport a message with a devid attribute as an address (such as DEVCALL(devid)) and the framework provides the rest of the decode+execute function. vhost avoids that. No, it doesn't avoid it. It just doesn't specify how its done, and relies on something else to do it on its behalf. That someone else can be in userspace, apart from the actual fast path. Conversely, vbus specifies how its done, but not how to transport the verb across the wire. That is the role of the vbus-connector abstraction. So again, vbus does everything in the kernel (since it's so easy and cheap) but expects a vbus-connector. vhost does configuration in userspace (since it's so clunky and fragile) but expects a couple of eventfds. Contrast this to vhost+virtio-pci (called simply vhost from here). It's the wrong name. vhost implements only the data path. Understood, but vhost+virtio-pci is what I am contrasting, and I use vhost for short from that point on because I am too lazy to type the whole name over and over ;) If you #define A A+B+C don't expect intelligent conversation afterwards. It is not immune to requiring in-kernel addressing support either, but rather it just does it differently (and its not as you might expect via qemu). Vhost relies on QEMU to render PCI objects to the guest, which the guest assigns resources (such
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote: On 09/22/2009 12:43 AM, Ira W. Snyder wrote: Sure, virtio-ira and he is on his own to make a bus-model under that, or virtio-vbus + vbus-ira-connector to use the vbus framework. Either model can work, I agree. Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. Yes. vbus is more finely layered so there is less code duplication. The virtio layering was more or less dictated by Xen which doesn't have shared memory (it uses grant references instead). As a matter of fact lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that part is duplicated. It's probably possible to add a virtio-shmem.ko library that people who do have shared memory can reuse. Seems like a nice benefit of vbus. I've given it some thought, and I think that running vhost-net (or similar) on the ppc boards, with virtio-net on the x86 crate server will work. The virtio-ring abstraction is almost good enough to work for this situation, but I had to re-invent it to work with my boards. I've exposed a 16K region of memory as PCI BAR1 from my ppc board. Remember that this is the host system. I used each 4K block as a device descriptor which contains: 1) the type of device, config space, etc. for virtio 2) the desc table (virtio memory descriptors, see virtio-ring) 3) the avail table (available entries in the desc table) Won't access from x86 be slow to this memory (on the other hand, if you change it to main memory access from ppc will be slow... really depends on how your system is tuned. Writes across the bus are fast, reads across the bus are slow. These are just the descriptor tables for memory buffers, not the physical memory buffers themselves. These only need to be written by the guest (x86), and read by the host (ppc). The host never changes the tables, so we can cache a copy in the guest, for a fast detach_buf() implementation (see virtio-ring, which I'm copying the design from). The only accesses are writes across the PCI bus. There is never a need to do a read (except for slow-path configuration). Parts 2 and 3 are repeated three times, to allow for a maximum of three virtqueues per device. This is good enough for all current drivers. The plan is to switch to multiqueue soon. Will not affect you if your boards are uniprocessor or small smp. Everything I have is UP. I don't need extreme performance, either. 40MB/sec is the minimum I need to reach, though I'd like to have some headroom. For reference, using the CPU to handle data transfers, I get ~2MB/sec transfers. Using the DMA engine, I've hit about 60MB/sec with my crossed-wires virtio-net. I've gotten plenty of email about this from lots of interested developers. There are people who would like this kind of system to just work, while having to write just some glue for their device, just like a network driver. I hunch most people have created some proprietary mess that basically works, and left it at that. So long as you keep the system-dependent features hookable or configurable, it should work. So, here is a desperate cry for help. I'd like to make this work, and I'd really like to see it in mainline. I'm trying to give back to the community from which I've taken plenty. Not sure who you're crying for help to. Once you get this working, post patches. If the patches are reasonably clean and don't impact performance for the main use case, and if you can show the need, I expect they'll be merged. In the spirit of post early and often, I'm making my code available, that's all. I'm asking anyone interested for some review, before I have to re-code this for about the fifth time now. I'm trying to avoid Haskins' situation, where he's invented and debugged a lot of new code, and then been told to do it completely differently. Yes, the code I posted is only compile-tested, because quite a lot of code (kernel and userspace) must be working before anything works at all. I hate to design the whole thing, then be told that something fundamental about it is wrong, and have to completely re-write it. Thanks for the comments, Ira -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote: Avi Kivity wrote: On 09/16/2009 10:22 PM, Gregory Haskins wrote: Avi Kivity wrote: On 09/16/2009 05:10 PM, Gregory Haskins wrote: If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. In the kernel. IMO that's the wrong place for it. 3) in-kernel: You can do something like virtio-net to vhost to potentially meet some of the requirements, but not all. In order to fully meet (3), you would need to do some of that stuff you mentioned in the last reply with muxing device-nr/reg-nr. In addition, we need to have a facility for mapping eventfds and establishing a signaling mechanism (like PIO+qid), etc. KVM does this with IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be invented. irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. Not per se, but it needs to be interfaced. How do I register that eventfd with the fastpath in Ira's rig? How do I signal the eventfd (x86-ppc, and ppc-x86)? Sorry to reply so late to this thread, I've been on vacation for the past week. If you'd like to continue in another thread, please start it and CC me. On the PPC, I've got a hardware doorbell register which generates 30 distiguishable interrupts over the PCI bus. I have outbound and inbound registers, which can be used to signal the other side. I assume it isn't too much code to signal an eventfd in an interrupt handler. I haven't gotten to this point in the code yet. To take it to the next level, how do I organize that mechanism so that it works for more than one IO-stream (e.g. address the various queues within ethernet or a different device like the console)? KVM has IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not have the luxury of an established IO paradigm. Is vbus the only way to implement a solution? No. But it is _a_ way, and its one that was specifically designed to solve this very problem (as well as others). (As an aside, note that you generally will want an abstraction on top of irqfd/eventfd like shm-signal or virtqueues to do shared-memory based event mitigation, but I digress. That is a separate topic). To meet performance, this stuff has to be in kernel and there has to be a way to manage it. and management belongs in userspace. vbus does not dictate where the management must be. Its an extensible framework, governed by what you plug into it (ala connectors and devices). For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD and DEVDROP hotswap events into the interrupt stream, because they are simple and we already needed the interrupt stream anyway for fast-path. As another example: venet chose to put -call(MACQUERY) config-space into its call namespace because its simple, and we already need -calls() for fastpath. It therefore exports an attribute to sysfs that allows the management app to set it. I could likewise have designed the connector or device-model differently as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI userspace) but this seems silly to me when they are so trivial, so I didn't. Since vbus was designed to do exactly that, this is what I would advocate. You could also reinvent these concepts and put your own mux and mapping code in place, in addition to all the other stuff that vbus does. But I am not clear why anyone would want to. Maybe they like their backward compatibility and Windows support. This is really not relevant to this thread, since we are talking about Ira's hardware. But if you must bring this up, then I will reiterate that you just design the connector to interface with QEMU+PCI and you have that too if that was important to you. But on that topic: Since you could consider KVM a motherboard manufacturer of sorts (it just happens to be virtual hardware), I don't know why KVM seems to consider itself the only motherboard manufacturer in the world that has to make everything look legacy. If a company like ASUS wants to add some cutting edge IO controller/bus, they simply do it. Pretty much every product release may contain a different array of devices, many of which are not backwards compatible with any prior silicon. The guy/gal installing Windows on that system may see a ? in device-manager until they load a driver that supports the new chip, and subsequently it works. It is certainly not a requirement to make said chip somehow work with existing drivers/facilities on bare metal, per se. Why should virtual systems be different? So, yeah, the
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Hello Michael, I've started looking at vhost with the intention of using it over PCI to connect physical machines together. The part that I am struggling with the most is figuring out which parts of the rings are in the host's memory, and which parts are in the guest's memory. If I understand everything correctly, the rings are all userspace addresses, which means that they can be moved around in physical memory, and get pushed out to swap. AFAIK, this is impossible to handle when connecting two physical systems, you'd need the rings available in IO memory (PCI memory), so you can ioreadXX() them instead. To the best of my knowledge, I shouldn't be using copy_to_user() on an __iomem address. Also, having them migrate around in memory would be a bad thing. Also, I'm having trouble figuring out how the packet contents are actually copied from one system to the other. Could you point this out for me? Is there somewhere I can find the userspace code (kvm, qemu, lguest, etc.) code needed for interacting with the vhost misc device so I can get a better idea of how userspace is supposed to work? (Features negotiation, etc.) Thanks, Ira Acked-by: Arnd Bergmann a...@arndb.de Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 475 ++ drivers/vhost/vhost.c | 688 drivers/vhost/vhost.h | 122 include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 101 +++ 11 files changed, 1413 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: k...@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3)
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 08:31:04PM +0300, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 10:19:22AM -0700, Ira W. Snyder wrote: [ snip out code ] We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. That would make it possible for simple device drivers to use the same driver in both host and guest, similar to how Ira Snyder used virtqueues to make virtio_net run between two hosts running the same code [1]. Ideally, I guess you should be able to even make virtio_net work in the host if you do that, but that could bring other complexities. I have no comments about the vhost code itself, I haven't reviewed it. It might be interesting to try using a virtio-net in the host kernel to communicate with the virtio-net running in the guest kernel. The lack of a management interface is the biggest problem you will face (setting MAC addresses, negotiating features, etc. doesn't work intuitively). That was one of the reasons I decided to move most of code out to userspace. My kernel driver only handles datapath, it's much smaller than virtio net. Getting the network interfaces talking is relatively easy. Ira Tried this, but - guest memory isn't pinned, so copy_to_user to access it, errors need to be handled in a sane way - used/available roles are reversed - kick/interrupt roles are reversed So most of the code then looks like if (host) { } else { } return The only common part is walking the descriptor list, but that's like 10 lines of code. At which point it's better to keep host/guest code separate, IMO. Ok, that makes sense. Let me see if I understand the concept of the driver. Here's a picture of what makes sense to me: guest system - | userspace applications| - | kernel network stack | - | virtio-net| - | transport (virtio-ring, etc.) | - | | - | transport (virtio-ring, etc.) | - | some driver (maybe vhost?)| -- [1] - | kernel network stack | - host system From the host's network stack, packets can be forwarded out to the physical network, or be consumed by a normal userspace application on the host. Just as if this were any other network interface. In my patch, [1] was the virtio-net driver, completely unmodified. So, does this patch accomplish the above diagram? If so, why the copy_to_user(), etc? Maybe I'm confusing this with my system, where the guest is another physical system, separated by the PCI bus. Ira ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 07:03:22PM +0200, Arnd Bergmann wrote: On Monday 10 August 2009, Michael S. Tsirkin wrote: +struct workqueue_struct *vhost_workqueue; [nitpicking] This could be static. +/* The virtqueue structure describes a queue attached to a device. */ +struct vhost_virtqueue { + struct vhost_dev *dev; + + /* The actual ring of buffers. */ + struct mutex mutex; + unsigned int num; + struct vring_desc __user *desc; + struct vring_avail __user *avail; + struct vring_used __user *used; + struct file *kick; + struct file *call; + struct file *error; + struct eventfd_ctx *call_ctx; + struct eventfd_ctx *error_ctx; + + struct vhost_poll poll; + + /* The routine to call when the Guest pings us, or timeout. */ + work_func_t handle_kick; + + /* Last available index we saw. */ + u16 last_avail_idx; + + /* Last index we used. */ + u16 last_used_idx; + + /* Outstanding buffers */ + unsigned int inflight; + + /* Is this blocked? */ + bool blocked; + + struct iovec iov[VHOST_NET_MAX_SG]; + +} cacheline_aligned; We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. That would make it possible for simple device drivers to use the same driver in both host and guest, similar to how Ira Snyder used virtqueues to make virtio_net run between two hosts running the same code [1]. Ideally, I guess you should be able to even make virtio_net work in the host if you do that, but that could bring other complexities. I have no comments about the vhost code itself, I haven't reviewed it. It might be interesting to try using a virtio-net in the host kernel to communicate with the virtio-net running in the guest kernel. The lack of a management interface is the biggest problem you will face (setting MAC addresses, negotiating features, etc. doesn't work intuitively). Getting the network interfaces talking is relatively easy. Ira ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization