RE: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
-Original Message- From: Scott Feldman [mailto:scofe...@cisco.com] Sent: Wednesday, May 05, 2010 7:04 PM To: Shreyas Bhatewara; Arnd Bergmann; Dmitry Torokhov Cc: Christoph Hellwig; pv-driv...@vmware.com; net...@vger.kernel.org; linux-ker...@vger.kernel.org; virtualizat...@lists.linux- foundation.org; Pankaj Thakkar Subject: Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3 On 5/5/10 10:29 AM, Dmitry Torokhov d...@vmware.com wrote: It would not be a binary blob but software properly released under GPL. The current plan is for the shell to enforce GPL requirement on the plugin code, similar to what module loaded does for regular kernel modules. On 5/5/10 3:05 PM, Shreyas Bhatewara sbhatew...@vmware.com wrote: The plugin image is not linked against Linux kernel. It is OS agnostic infact (Eg. same plugin works for Linux and Windows VMs) Are there any issues with injecting the GPL-licensed plug-in into the Windows vmxnet3 NDIS driver? -scott ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [GIT PULL] amended: first round of vhost-net enhancements for net-next
From: Michael S. Tsirkin m...@redhat.com Date: Tue, 4 May 2010 14:21:01 +0300 This is an amended pull request: I have rebased the tree to the correct patches. This has been through basic tests and seems to work fine here. The following tree includes a couple of enhancements that help vhost-net. Please pull them for net-next. Another set of patches is under debugging/testing and I hope to get them ready in time for 2.6.35, so there may be another pull request later. Pulled, thanks Michael. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Rusty Russell wrote: On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. We do want flushes for the D in ACID - such things as after receiving a mail, or blog update into a database file (could be TDB), and confirming that to the sender, to have high confidence that the update won't disappear on system crash or power failure. Less obviously, it's also needed for the C in ACID when more than one file is involved. C is about differently updated things staying consistent with each other. For example, imagine you have a TDB file mapping Samba usernames to passwords, and another mapping Samba usernames to local usernames. (I don't know if you do this; it's just an illustration). To rename a Samba user involves updating both. Let's ignore transient transactional issues :-) and just think about what happens with per-file barriers and no sync, when a crash happens long after the updates, and before the system has written out all data and issued low level cache flushes. After restarting, due to lack of sync, the Samba username could be present in one file and not the other. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. I agree, and I've started a few threads about it over the last couple of years. An fsync_range() system call would be very easy to use and (most importantly) easy to understand. With optional flags to weaken it (into fdatasync, barrier without sync, sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush, etc.), it would be very versatile, and still easy to understand. With an AIO version, and another flag meaning don't rush, just return when satisfied, and I suspect it would be useful for the most demanding I/O apps. -- Jamie ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
On Thu, May 06, 2010 at 01:19:33AM -0700, Gleb Natapov wrote: Overhead of interpreting bytecode plugin is written in. Or are you saying plugin is x86 assembly (32bit or 64bit btw?) and other arches will have to have in kernel x86 emulator to use the plugin (like some of them had for vgabios)? Plugin is x86 or x64 machine code. You write the plugin in C and compile it using gcc/ld to get the object file, we map the relevant sections only to the OS space. NPA is a way of enabling passthrough of SR-IOV NICs with live migration support on ESX Hypervisor which runs only on x86/x64 hardware. It only supports x86/x64 guest OS. So we don't have to worry about other architectures. If NPA approach needs to be extended and adopted by other hypervisors then we have to take care of that. Today we have two plugins images per VF (one for 32-bit, one for 64-bit). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
On Wed, May 05, 2010 at 10:47:10AM -0700, Pankaj Thakkar wrote: Forget about the licensing. Loading binary blobs written to a shim layer is a complete pain in the ass and totally unsupportable, and also uninteresting because of the overhead. [PT] Why do you think it is unsupportable? How different is it from any module written against a well maintained interface? What overhead are you talking about? We only support in-kernel drivers, everything else is subject to changes in the kernel API and ABI. What you do is basically introducing another wrapper layer not allowing full access to the normal Linux API. People have tried this before and we're not willing to add it. Do a little research on Project UDI if you're curious. (1) move the limited VF drivers directly into the kernel tree, talk to them through a normal ops vector [PT] This assumes that all the VF drivers would always be available. Yes, absolutely. Just as we assume that for every other driver. Also we have to support windows and our current design supports it nicely in an OS agnostic manner. And that's not something we care about at all. The Linux kernel has traditionally a very hostile position against cross platform drivers for reasons well explained before at many occasions. (2) get rid of the whole shim crap and instead integrate the limited VF driver with the full VF driver we already have, instead of duplicating the code [PT] Having a full VF driver adds a lot of dependency on the guest VM and this is what NPA tries to avoid. Yes, of course it does. It's a normal driver at the point which it should have been from day one. (3) don't make the PV to VF integration VMware-specific but also provide an open reference implementation like virtio. We're not going to add massive amount of infrastructure that is not actually useable in a free software stack. [PT] Today this is tied to vmxnet3 device and is intended to work on ESX hypervisor only (vmxnet3 works on VMware hypervisor only). All the loading support is inside the ESX hypervisor. I am going to post the interface between the shell and the plugin soon and you can see that there is not a whole lot of dependency or infrastructure requirements from the Linux kernel. Please keep in mind that we don't use Linux as a hypervisor but as a guest VM. But we use Linux as the hypervisor, too. So if you want to target a major infrastructure you might better make it available for that case. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote: Let me put it bluntly. Any design that allows external code to run in the kernel is not going to be accepted. Out of tree kernel modules are enough of a pain already, why do you expect the developers to add another interface. Exactly. Until our friends at VMware get this basic fact it's useless to continue arguing. Pankaj and Dmitry: you're fine to waste your time on this, but it's not going to go anywhere until you address that fundamental problem. The first thing you need to fix in your archicture is to integrate the VF function code into the kernel tree, and we can work from there. Please post patches doing this if you want to resume the discussion. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio: put last_used and last_avail index into ring itself.
On Thu, 6 May 2010 03:57:55 pm Michael S. Tsirkin wrote: On Thu, May 06, 2010 at 10:22:12AM +0930, Rusty Russell wrote: On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote: What do you think? I think everyone is settled on 128 byte cache lines for the forseeable future, so it's not really an issue. You mean with 64 bit descriptors we will be bouncing a cache line between host and guest, anyway? I'm confused by this entire thread. Descriptors are 16 bytes. They are at the start, so presumably aligned to cache boundaries. Available ring follows that at 2 bytes per entry, so it's also packed nicely into cachelines. Then there's padding to page boundary. That puts us on a cacheline again for the used ring; also 2 bytes per entry. I don't see how any change in layout could be more cache friendly? Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: + /* We publish the last-seen used index at the end of the available ring. +* It is at the end for backwards compatibility. */ + vr-last_used_idx =(vr)-avail-ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr-last_used_idx + + sizeof *vr-last_used_idx (void *)vr-used); } Shouldn't this be on its own cache line? It's next to the available ring; because that's where the guest publishes its data. That whole page is guest-write, host-read. Putting it on a cacheline by itself would be a slight pessimization; the host cpu would have to get the last_used_idx cacheline and the avail descriptor cacheline every time. This way, they are sometimes the same cacheline. Hope that clarifies, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 03:49:46 pm Michael S. Tsirkin wrote: Now, I also added an mb() in guest between read and write so that last used index write can not get ahead of used index read. It does feel good to have it there, but I can not say why it's helpful. Works fine without it, but then these subtle races might be hard to trigger. What do you think? I couldn't see that in the patch? I don't think it's necessary though, since the write of depends last_used depends on the read of used (and no platform we care about would reorder such a thing). I'm reasonably happy, but we should write some convenient test for missing interrupts. I'm thinking of a sender which does a loop: blasts 1MB of UDP packets, then prints the time and sleep(1). The receiver would print the time every 1MB of received data. The two times should almost exactly correspond. Assuming that the network doesn't overflow and lose stuff, this should identify any missing wakeup/interrupts (depending on direction used). Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] virtio: initialize earlier
From: Stijn Tintel st...@linux-ipv6.be Move initialization of the virtio framework before the initialization of mtd, so that block2mtd can be used on virtio-based block devices. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15644 Signed-off-by: Stijn Tintel st...@linux-ipv6.be Signed-off-by: Rusty Russell ru...@rustcorp.com.au Cc: sta...@kernel.org --- drivers/Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 34f1e10..f42a030 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -17,6 +17,7 @@ obj-$(CONFIG_SFI) += sfi/ obj-$(CONFIG_PNP) += pnp/ obj-$(CONFIG_ARM_AMBA) += amba/ +obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_XEN) += xen/ # regulators early, since some subsystems rely on them to initialize @@ -108,7 +109,6 @@ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB) += ssb/ obj-$(CONFIG_VHOST_NET)+= vhost/ -obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ)+= vlynq/ obj-$(CONFIG_STAGING) += staging/ obj-y += platform/ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization