Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Friday 09 April 2010, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Sorry for taking so long before finding the time to look at your code in more detail. It seems that you are duplicating a lot of functionality that is already in macvtap. I've asked about this before but then didn't look at your newer versions. Can you explain the value of introducing another interface to user land? I'm still planning to add zero-copy support to macvtap, hopefully reusing parts of your code, but do you think there is value in having both? diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..86d2525 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1264 @@ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif This should probably just use the existing dev_dbg/pr_debug infrastructure. [... skipping buffer management code for now] +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock, + struct msghdr *m, size_t total_len) +{ [...] This function looks like we should be able to easily include it into macvtap and get zero-copy transmits without introducing the new user-level interface. +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock, + struct msghdr *m, size_t total_len, + int flags) +{ + struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp; + struct page_ctor *ctor; + struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb-private); It smells like a layering violation to look at the iocb-private field from a lower-level driver. I would have hoped that it's possible to implement this without having this driver know about the higher-level vhost driver internals. Can you explain why this is needed? + spin_lock_irqsave(ctor-read_lock, flag); + list_add_tail(info-list, ctor-readq); + spin_unlock_irqrestore(ctor-read_lock, flag); + + if (!vq-receiver) { + vq-receiver = mp_recvmsg_notify; + set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, +vq-num * 4096, +vq-num * 4096); + } + + return 0; +} Not sure what I'm missing, but who calls the vq-receiver? This seems to be neither in the upstream version of vhost nor introduced by your patch. +static void __mp_detach(struct mp_struct *mp) +{ + mp-mfile = NULL; + + mp_dev_change_flags(mp-dev, mp-dev-flags ~IFF_UP); + page_ctor_detach(mp); + mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP); + + /* Drop the extra count on the net device */ + dev_put(mp-dev); +} + +static DEFINE_MUTEX(mp_mutex); + +static void mp_detach(struct mp_struct *mp) +{ + mutex_lock(mp_mutex); + __mp_detach(mp); + mutex_unlock(mp_mutex); +} + +static void mp_put(struct mp_file *mfile) +{ + if (atomic_dec_and_test(mfile-count)) + mp_detach(mfile-mp); +} + +static int mp_release(struct socket *sock) +{ + struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp; + struct mp_file *mfile = mp-mfile; + + mp_put(mfile); + sock_put(mp-socket.sk); + put_net(mfile-net); + + return 0; +} Doesn't this prevent the underlying interface from going away while the chardev is open? You also have logic to handle that case, so why do you keep the extra reference on the netdev? +/* Ops structure to mimic raw sockets with mp device */ +static const struct proto_ops mp_socket_ops = { + .sendmsg = mp_sendmsg, + .recvmsg = mp_recvmsg, + .release = mp_release, +}; +static int mp_chr_open(struct inode *inode, struct file * file) +{ + struct mp_file *mfile; + cycle_kernel_lock(); I don't think you really want to use the BKL here, just kill that line. +static long mp_chr_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct mp_file *mfile = file-private_data; + struct mp_struct *mp; + struct net_device *dev; + void __user* argp = (void __user *)arg; + struct ifreq ifr; + struct sock *sk; + int ret; + + ret = -EINVAL; + + switch (cmd) { + case MPASSTHRU_BINDDEV: + ret = -EFAULT; + if (copy_from_user(ifr, argp, sizeof ifr)) + break; This is broken for 32 bit compat mode ioctls, because struct ifreq is different between 32 and 64 bit systems. Since you are only using the device name anyway, a fixed length string or just the interface index would be simpler and work better.
Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Wednesday 14 April 2010, Michael S. Tsirkin wrote: On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote: On Friday 09 April 2010, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com It seems that you are duplicating a lot of functionality that is already in macvtap. I've asked about this before but then didn't look at your newer versions. Can you explain the value of introducing another interface to user land? Hmm, I have not noticed a lot of duplication. The code is indeed quite distinct, but the idea of adding another character device to pass into vhost for direct device access is. BTW macvtap also duplicates tun code, it might be a good idea for tun to export some functionality. Yes, that's something I plan to look into. I'm still planning to add zero-copy support to macvtap, hopefully reusing parts of your code, but do you think there is value in having both? If macvtap would get zero copy tx and rx, maybe not. But it's not immediately obvious whether zero-copy support for macvtap might work, though, especially for zero copy rx. The approach with mpassthru is much simpler in that it takes complete control of the device. As far as I can tell, the most significant limitation of mpassthru is that there can only ever be a single guest on a physical NIC. Given that limitation, I believe we can do the same on macvtap, and simply disable zero-copy RX when you want to use more than one guest, or both guest and host on the same NIC. The logical next step here would be to allow VMDq and similar technologies to separate out the RX traffic in the hardware. We don't have a configuration interface for that yet, but since this is logically the same as macvlan, I think we should use the same interfaces for both, essentially treating VMDq as a hardware acceleration for macvlan. We can probably handle it in similar ways to how we handle hardware support for vlan. At that stage, macvtap would be the logical interface for connecting a VMDq (hardware macvlan) device to a guest! +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long count, loff_t pos) +{ + struct file *file = iocb-ki_filp; + struct mp_struct *mp = mp_get(file-private_data); + struct sock *sk = mp-socket.sk; + struct sk_buff *skb; + int len, err; + ssize_t result; Can you explain what this function is even there for? AFAICT, vhost-net doesn't call it, the interface is incompatible with the existing tap interface, and you don't provide a read function. qemu needs the ability to inject raw packets into device from userspace, bypassing vhost/virtio (for live migration). Ok, but since there is only a write callback and no read, it won't actually be able to do this with the current code, right? Moreover, it seems weird to have a new type of interface here that duplicates tap/macvtap with less functionality. Coming back to your original comment, this means that while mpassthru is currently not duplicating the actual code from macvtap, it would need to do exactly that to get the qemu interface right! Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Wednesday 14 April 2010, Michael S. Tsirkin wrote: qemu needs the ability to inject raw packets into device from userspace, bypassing vhost/virtio (for live migration). Ok, but since there is only a write callback and no read, it won't actually be able to do this with the current code, right? I think it'll work as is, with vhost qemu only ever writes, never reads from device. We'll also never need GSO etc which is a large part of what tap does (and macvtap will have to do). Ah, I see. I didn't realize that qemu needs to write to the device even if vhost is used. But for the case of migration to another machine without vhost, wouldn't qemu also need to read? Moreover, it seems weird to have a new type of interface here that duplicates tap/macvtap with less functionality. Coming back to your original comment, this means that while mpassthru is currently not duplicating the actual code from macvtap, it would need to do exactly that to get the qemu interface right! I don't think so, see above. anyway, both can reuse tun.c :) There is one significant difference between macvtap/mpassthru and tun/tap in that the directions are reversed. While macvtap and mpassthru forward data from write into dev_queue_xmit and from skb_receive into read, tun/tap forwards data from write into skb_receive and from start_xmit into read. Also, I'm not really objecting to duplicating code between macvtap and mpassthru, as the implementation can always be merged. My main objection is instead to having two different _user_interfaces_ for doing the same thing. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote: On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote: On Wednesday 14 April 2010, Michael S. Tsirkin wrote: qemu needs the ability to inject raw packets into device from userspace, bypassing vhost/virtio (for live migration). Ok, but since there is only a write callback and no read, it won't actually be able to do this with the current code, right? I think it'll work as is, with vhost qemu only ever writes, never reads from device. We'll also never need GSO etc which is a large part of what tap does (and macvtap will have to do). Ah, I see. I didn't realize that qemu needs to write to the device even if vhost is used. But for the case of migration to another machine without vhost, wouldn't qemu also need to read? Not that I know. Why? Well, if the guest not only wants to send data but also receive frames coming from other machines, they need to get from the kernel into qemu, and the only way I can see for doing that is to read from this device if there is no vhost support around on the new machine. Maybe we're talking about different things here. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Wednesday 14 April 2010 22:40:03 Michael S. Tsirkin wrote: On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote: Well, if the guest not only wants to send data but also receive frames coming from other machines, they need to get from the kernel into qemu, and the only way I can see for doing that is to read from this device if there is no vhost support around on the new machine. Maybe we're talking about different things here. mpassthrough is currently useless without vhost. If the new machine has no vhost, it can't use mpassthrough :) Ok. Is that a planned feature though? vhost is currently limited to guests with a virtio-net driver and even if you extend it to other guest emulations, it will probably always be a subset of the qemu supported drivers, but it may be useful to support zero-copy on other drivers as well. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
On Thursday 15 April 2010, Xin, Xiaohui wrote: It seems that you are duplicating a lot of functionality that is already in macvtap. I've asked about this before but then didn't look at your newer versions. Can you explain the value of introducing another interface to user land? I'm still planning to add zero-copy support to macvtap, hopefully reusing parts of your code, but do you think there is value in having both? I have not looked into your macvtap code in detail before. Does the two interface exactly the same? We just want to create a simple way to do zero-copy. Now it can only support vhost, but in future we also want it to support directly read/write operations from user space too. Right now, the features are mostly distinct. Macvtap first of all provides a tap style interface for users, and can also be used by vhost-net. It also provides a way to share a NIC among a number of guests by software, though I indent to add support for VMDq and SR-IOV as well. Zero-copy is also not yet done in macvtap but should be added. mpassthru right now does not allow sharing a NIC between guests, and does not have a tap interface for non-vhost operation, but does the zero-copy that is missing in macvtap. Basically, compared to the interface, I'm more worried about the modification to net core we have made to implement zero-copy now. If this hardest part can be done, then any user space interface modifications or integrations are more easily to be done after that. I agree that the network stack modifications are the hard part for zero-copy, and your work on that looks very promising and is complementary to what I've done with macvtap. Your current user interface looks good for testing this out, but I think we should not merge it (the interface) upstream if we can get the same or better result by integrating your buffer management code into macvtap. I can try to merge your code into macvtap myself if you agree, so you can focus on getting the internals right. Not sure what I'm missing, but who calls the vq-receiver? This seems to be neither in the upstream version of vhost nor introduced by your patch. See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost. Ok, I see. As a general rule, it's preferred to split a patch series in a way that makes it possible to apply each patch separately and still get a working kernel, ideally with more features than the version before the patch. I believe you could get there by reordering your patches to make the actual driver the last one in the series. Not a big problem though, I was mostly looking in the wrong place. + ifr.ifr_name[IFNAMSIZ-1] = '\0'; + + ret = -EBUSY; + + if (ifr.ifr_flags IFF_MPASSTHRU_EXCL) + break; Your current use of the IFF_MPASSTHRU* flags does not seem to make any sense whatsoever. You check that this flag is never set, but set it later yourself and then ignore all flags. Using that flag is tried to prevent if another one wants to bind the same device Again. But I will see if it really ignore all other flags. The ifr variable is on the stack of the mp_chr_ioctl function, and you never look at the value after setting it. In order to prevent multiple opens of that device, you probably need to lock out any other users as well, and make it a property of the underlying device. E.g. you also want to prevent users on the host from setting an IP address on the NIC and using it to send and receive data there. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
On Friday 23 April 2010, Avi Kivity wrote: On 04/23/2010 01:20 PM, Alexander Graf wrote: I would say the reason is that if we did not convert the user-space pointer to a void * kvm_get_dirty_log() would end up copying the dirty log to (log-dirty_bitmap 32) | 0x Well yes, that was the problem. If we always set the __u64 value to the pointer we're safe though. union { void *p; __u64 q; } void x(void *r) { // breaks: p = r; // works: q = (ulong)r; } In that case it's better to avoid p altogether, since users will naturally assign to the pointer. Right. Using a 64-bit integer avoids the problem (though perhaps not sufficient for s390, Arnd?) When there is only a __u64 for the address, it will work on s390 as well, gcc is smart enough to clear the upper bit on a cast from long to pointer. The simple rule is to never put any 'long' or pointer into data structures that you pass to an ioctl, and to add padding to multiples of 64 bit to align the data structure for the x86 alignment problem. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
On Friday 23 April 2010, Avi Kivity wrote: On 04/23/2010 03:27 PM, Arnd Bergmann wrote: Using a 64-bit integer avoids the problem (though perhaps not sufficient for s390, Arnd?) When there is only a __u64 for the address, it will work on s390 as well, gcc is smart enough to clear the upper bit on a cast from long to pointer. Ah, much more convenient than compat_ioctl. I assume it part of the ABI, not a gccism? I don't think it's part of the ABI, but it's required to guarantee that code like this works: int compare_pointer(void *a, void *b) { unsigned long ai = (unsigned long)a, bi = (unsigned long)b; return ai == bi; /* true if a and b point to the same object */ } We certainly rely on this already. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
On Friday 23 April 2010, Avi Kivity wrote: Ah so the 31st bit is optional as far as userspace is concerned? What does it mean? (just curious) On data pointers it's ignored. When you branch to a function, this bit determines whether the target function is run in 24 or 31 bit mode. This allows linking to legacy code on older operating systems that also support 24 bit libraries. What happens on the opposite conversion? is it restored? What about int compare_pointer(void *a, void *b) { unsigned long ai = (unsigned long)a; void *aia = (void *)ai; return a == b; /* true if a and b point to the same object */ } Some instructions set the bit, others clear it, so aia and a may not be bitwise identical. Does gcc mask the big in pointer comparisons as well? Yes. To stay in the above example: a == aia; /* true */ (unsigned long)a == (unsigned long)aia; /* true */ *(unsigned long *)a == *(unsigned long *)aia; /* undefined on s390 */ Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
On Tuesday 04 May 2010, Takuya Yoshikawa wrote: Although we can use *_le_bit() helpers to treat bitmaps le arranged, having le bit offset calculation as a seperate macro gives us more freedom. For example, KVM has le arranged dirty bitmaps for VGA, live-migration and they are used in user space too. To avoid bitmap copies between kernel and user space, we want to update the bitmaps in user space directly. To achive this, le bit offset with *_user() functions help us a lot. So let us use the le bit offset calculation part by defining it as a new macro: generic_le_bit_offset() . Does this work correctly if your user space is 32 bits (i.e. unsigned long is different size in user space and kernel) in both big- and little-endian systems? I'm not sure about all the details, but I think you cannot in general share bitmaps between user space and kernel because of this. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro
On Monday 10 May 2010, Takuya Yoshikawa wrote: (2010/05/06 22:38), Arnd Bergmann wrote: On Wednesday 05 May 2010, Takuya Yoshikawa wrote: There was a suggestion to propose set_le_bit_user() kind of macros. But what I thought was these have a constraint you two explained and seemed to be a little bit specific to some area, like KVM. So I decided to propose just the offset calculation macro. I'm not sure I understand how this macro is going to be used though. If you are just using this in kernel space, that's fine, please go for it. Yes, I'm just using in kernel space: qemu has its own endian related helpers. So if you allow us to place this macro in asm-generic/bitops/* it will help us. No problem at all then. Thanks for the explanation. Acked-by: Arnd Bergmann a...@arndb.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Saturday 29 May 2010, Tom Lyon wrote: +/* + * Structure for DMA mapping of user buffers + * vaddr, dmaaddr, and size must all be page aligned + * buffer may only be larger than 1 page if (a) there is + * an iommu in the system, or (b) buffer is part of a huge page + */ +struct vfio_dma_map { + __u64 vaddr; /* process virtual addr */ + __u64 dmaaddr;/* desired and/or returned dma address */ + __u64 size; /* size in bytes */ + int rdwr; /* bool: 0 for r/o; 1 for r/w */ +}; Please add a 32 bit padding word at the end of this, otherwise the size of the data structure is incompatible between 32 x86 applications and 64 bit kernels. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: vhost-net driver projects I still think that list should include - UDP multicast socket support - TCP socket support - raw packet socket support for qemu (from Or Gerlitz) if we have those, plus the tap support that is already on your list, we can use vhost-net as a generic offload for the host networking in qemu. projects involing networking stack - export socket from tap so vhost can use it - working on it now - extend raw sockets to support GSO/checksum offloading, and teach vhost to use that capability [one way to do this: virtio net header support] will allow working with e.g. macvlan One thing I'm planning to work on is bridge support in macvlan, together with VEPA compliant operation, i.e. not sending back multicast frames to the origin. I'll also keep looking into macvtap, though that will be less important once you get the tap socket support running. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tuesday 15 September 2009, Michael S. Tsirkin wrote: Userspace in x86 maps a PCI region, uses it for communication with ppc? This might have portability issues. On x86 it should work, but if the host is powerpc or similar, you cannot reliably access PCI I/O memory through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel calls, which don't work on user pointers. Specifically on powerpc, copy_from_user cannot access unaligned buffers if they are on an I/O mapping. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: On Wed, Sep 16, 2009 at 04:52:40PM +0200, Arnd Bergmann wrote: On Wednesday 16 September 2009, Michael S. Tsirkin wrote: vhost-net driver projects I still think that list should include Yea, why not. Go wild. - UDP multicast socket support - TCP socket support Switch to UDP unicast while we are at it? tunneling raw packets over TCP looks wrong. Well, TCP is what qemu supports right now, that's why I added it to the list. We could add UDP unicast as yet another protocol in both qemu and vhost_net if there is demand for it. The implementation should be trivial based on the existing code paths. One thing I'm planning to work on is bridge support in macvlan, together with VEPA compliant operation, i.e. not sending back multicast frames to the origin. is multicast filtering already there (i.e. only getting frames for groups you want)? No, I think this is less important, because the bridge code also doesn't do this. I'll also keep looking into macvtap, though that will be less important once you get the tap socket support running. Not sure I see the connection. to get an equivalent to macvtap, what you need is tso etc support in packet sockets. No? I'm not worried about tso support here. One of the problems that raw packet sockets have is the requirement for root permissions (e.g. through libvirt). Tap sockets and macvtap both don't have this limitation, so you can use them as a regular user without libvirt. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote: On Tuesday 15 September 2009, Michael S. Tsirkin wrote: Userspace in x86 maps a PCI region, uses it for communication with ppc? This might have portability issues. On x86 it should work, but if the host is powerpc or similar, you cannot reliably access PCI I/O memory through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel calls, which don't work on user pointers. Specifically on powerpc, copy_from_user cannot access unaligned buffers if they are on an I/O mapping. We are talking about doing this in userspace, not in kernel. Ok, that's fine then. I thought the idea was to use the vhost_net driver to access the user memory, which would be a really cute hack otherwise, as you'd only need to provide the eventfds from a hardware specific driver and could use the regular virtio_net on the other side. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: No, I think this is less important, because the bridge code also doesn't do this. True, but the reason might be that it is much harder in bridge (you have to snoop multicast registrations). With macvlan you know which multicasts does each device want. Right. It shouldn't be hard to do, and I'll probably get to that after the other changes. One of the problems that raw packet sockets have is the requirement for root permissions (e.g. through libvirt). Tap sockets and macvtap both don't have this limitation, so you can use them as a regular user without libvirt. I don't see a huge difference here. If you are happy with the user being able to bypass filters in host, just give her CAP_NET_RAW capability. It does not have to be root. Capabilities are nice in theory, but I've never seen them being used effectively in practice, where it essentially comes down to some SUID wrapper. Also, I might not want to allow the user to open a random random raw socket, but only one on a specific downstream port of a macvlan interface, so I can filter out the data from that respective MAC address in an external switch. That scenario is probably not so relevant for KVM, unless you consider the guest taking over the qemu host process a valid security threat. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Wednesday 16 September 2009, Michael S. Tsirkin wrote: Also, I might not want to allow the user to open a random random raw socket, but only one on a specific downstream port of a macvlan interface, so I can filter out the data from that respective MAC address in an external switch. I agree. Maybe we can fix that for raw sockets, want me to add it to the list? :) So far, I could not find any theoretical solution how to fix this, but if you think it can be done, it would be good to have it on the list somewhere. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Thursday 17 September 2009, Michael S. Tsirkin wrote: On Thu, Sep 17, 2009 at 01:30:00PM +0200, Arnd Bergmann wrote: On Wednesday 16 September 2009, Michael S. Tsirkin wrote: Also, I might not want to allow the user to open a random random raw socket, but only one on a specific downstream port of a macvlan interface, so I can filter out the data from that respective MAC address in an external switch. I agree. Maybe we can fix that for raw sockets, want me to add it to the list? :) So far, I could not find any theoretical solution how to fix this, What if socket had a LOCKBIND ioctl after which you can not bind it to any other device? Then someone with RAW capability can open the socket, bind to device and hand it to you. You can send packets but not switch to another device. Could work, though I was hoping for a solution that does not depend on a priviledged task at run time to open the socket, as you have with persistant tap devices or chardevs like macvtap that can have their persissions set by udev. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net todo list
On Thursday 17 September 2009, Michael S. Tsirkin wrote: Well, we could have a char device with an ioctl that gives you back a socket, or maybe even have it give you back a socket when you open it. Will that make you happy? Well, that would put is in the exact same spot as the tun/tap driver patch you're working on or my (still unfinished, I need to spend some time on it again) macvtap driver. As I said, either one addresses the problem but is unrelated to the raw socket interface. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Tuesday 22 September 2009, Michael S. Tsirkin wrote: More importantly, when virtualizations is used with multi-queue NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net NIC should preserve the parallelism (lock free) using multiple receive/transmit queues. The number of queues should equal the number of CPUs. Yup, multiqueue virtio is on todo list ;-) Note we'll need multiqueue tap for that to help. My idea for that was to open multiple file descriptors to the same macvtap device and let the kernel figure out the right thing to do with that. You can do the same with raw packed sockets in case of vhost_net, but I wouldn't want to add more complexity to the tun/tap driver for this. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Tuesday 22 September 2009, Stephen Hemminger wrote: My idea for that was to open multiple file descriptors to the same macvtap device and let the kernel figure out the right thing to do with that. You can do the same with raw packed sockets in case of vhost_net, but I wouldn't want to add more complexity to the tun/tap driver for this. Or get tap out of the way entirely. The packets should not have to go out to user space at all (see veth) How does veth relate to that, do you mean vhost_net? With vhost_net, you could still open multiple sockets, only the access is in the kernel. Obviously, once it all is in the kernel, that could be done under the covers, but I think it would be cleaner to treat vhost_net purely as a way to bypass the syscalls for user space, with as little as possible visible impact otherwise. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Release plan for 0.12.0
On Thursday 08 October 2009, Anthony Liguori wrote: Jens Osterkamp wrote: On Wednesday 30 September 2009, Anthony Liguori wrote: Please add to this list and I'll collect it all and post it somewhere. What about Or Gerlitz' raw backend driver ? I did not see it go in yet, or did I miss something ? The patch seems to have not been updated after the initial posting and the first feedback cycle. I'm generally inclined to oppose the functionality as I don't think it offers any advantages over the existing backends. There are two reasons why I think this backend is important: - As an easy way to provide isolation between guests (private ethernet port aggregator, PEPA) and external enforcement of network priviledges (virtual ethernet port aggregator, VEPA) using the macvlan subsystem. - As a counterpart to the vhost_net driver, providing an identical user interface with or without vhost_net acceleration in the kernel. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/27] Add KVM support for Book3s_64 (PPC64) hosts v5
On Wednesday 21 October 2009, Alexander Graf wrote: KVM for PowerPC only supports embedded cores at the moment. While it makes sense to virtualize on small machines, it's even more fun to do so on big boxes. So I figured we need KVM for PowerPC64 as well. This patchset implements KVM support for Book3s_64 hosts and guest support for Book3s_64 and G3/G4. To really make use of this, you also need a recent version of qemu. Don't want to apply patches? Get the git tree! $ git clone git://csgraf.de/kvm $ git checkout origin/ppc-v4 Whole series Acked-by: Arnd Bergmann a...@arndb.de Great work, Alex! Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vhost-net patches
On Monday 26 October 2009, Shirley Ma wrote: On Sun, 2009-10-25 at 11:11 +0200, Michael S. Tsirkin wrote: What is vnet0? That's a tap interface. I am binding raw socket to a tap interface and it doesn't work. Does it support? Is the tap device connected to a bridge as you'd normally do with qemu? That won't work because then the data you send to the socket will be queued at the /dev/tun chardev. You can probably connect it like this: qemu - vhost_net - vnet0 == /dev/tun - qemu To connect two guests. I've also used a bidirectional pipe before, to connect two tap interfaces to each other. However, if you want to connect to a bridge, the easier interface would be to use a veth pair, with one end on the bridge and the other end used for the packet socket. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6 1/3] tun: export underlying socket
On Monday 02 November 2009, Michael S. Tsirkin wrote: Tun device looks similar to a packet socket in that both pass complete frames from/to userspace. This patch fills in enough fields in the socket underlying tun driver to support sendmsg/recvmsg operations, and message flags MSG_TRUNC and MSG_DONTWAIT, and exports access to this socket to modules. Regular read/write behaviour is unchanged. This way, code using raw sockets to inject packets into a physical device, can support injecting packets into host network stack almost without modification. First user of this interface will be vhost virtualization accelerator. You mentioned before that you wanted to export the socket using some ioctl function returning an open file descriptor, which seemed to be a cleaner approach than this one. What was your reason for changing? index 3f5fd52..404abe0 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -86,4 +86,18 @@ struct tun_filter { __u8 addr[0][ETH_ALEN]; }; +#ifdef __KERNEL__ +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) +struct socket *tun_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *tun_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_TUN */ +#endif /* __KERNEL__ */ #endif /* __IF_TUN_H */ Is this a leftover from testing? Exporting the function for !__KERNEL__ seems pointless. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/27] Add book3s_64 specific opcode emulation
On Tuesday 03 November 2009, Benjamin Herrenschmidt wrote: (Though glibc can be nasty, afaik it might load up optimized variants of some routines with hard wired cache line sizes based on the CPU type) You can also get application with hand-coded cache optimizations that are even harder, if not impossible, to fix. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6 1/3] tun: export underlying socket
On Tuesday 03 November 2009, Arnd Bergmann wrote: index 3f5fd52..404abe0 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -86,4 +86,18 @@ struct tun_filter { __u8 addr[0][ETH_ALEN]; }; +#ifdef __KERNEL__ +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) +struct socket *tun_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *tun_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_TUN */ +#endif /* __KERNEL__ */ #endif /* __IF_TUN_H */ Is this a leftover from testing? Exporting the function for !__KERNEL__ seems pointless. Michael, you didn't reply on this comment and the code is still there in v8. Do you actually need this? What for? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Installing kernel headers in kvm-kmod
On Thursday 10 December 2009, Avi Kivity wrote: Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink /usr/local/include/kvm-kmod. Depends on how fine-grained you want to do the packaging. Most distributions split packages between code and development packages. The kvm-kmod code is the kernel module, so you want to be able to install it for multiple kernels simultaneously. Building the package only requires one version of the header and does not depend on the underlying kernel version, only on the version of the module, so it's reasonable to install only one version as the -dev package, and have a dependency in there to match the module version with the header version. The most complex setup would split the development package into one per kernel version and/or module version, plus an extra package for the module version containing only the symlink. I wouldn't go there. It may also be useful to do the equivalent of 'make headers_install' from the kernel, to remove all #ifdef __KERNEL__ sections and sparse annotations from the header files, but it should also work without that. Well, qemu.git needs __user removed. This one is taken care of by kvm_kmod in the sync script, though it would be cleaner to only do it for the installed version of the header, not for the one used to build kvm.ko. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Installing kernel headers in kvm-kmod
On Thursday 10 December 2009 17:14:40 Jan Kiszka wrote: Avi Kivity wrote: On 12/10/2009 06:42 PM, Jan Kiszka wrote: I've just (forced-)pushed the simple version with /usr/include/kvm-kmod as destination. The user headers are now stored under usr/include in the kvm-kmod sources and installed from there. It's customary to install to /usr/local, not to /usr (qemu does the same). Right. Specifically, an install from source should go to /usr/local/include by default, while a distro package should override the path to go to /usr/include, which the current version easily allows. This also means that qemu will have to look in three places now, /usr/local/include/kvm-kmod, /usr/include/kvm-kmod and /usr/include. Adding /usr/local/include probably doesn't hurt but should not be necessary. Adjusted accordingly. Moreover, I only install the target arch's header now. Looks good now. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Host-guest channel interface advice needed
On Wednesday 26 November 2008, Gleb Natapov wrote: The interfaces that are being considered are netlink socket (only datagram semantics, linux specific), new socket family or character device with different minor number for each channel. Which one better suits for the purpose? Is there other kind of interface to consider? New socket family looks like a good choice, but it would be nice to hear other opinions before starting to work on it. I think a socket and a pty both look reasonable here, but one important aspect IMHO is that you only need a new kernel driver for the guest, if you just use the regular pty support or Unix domain sockets in the host. Obviously, there needs to be some control over permissions, as a guest most not be able to just open any socket or pty of the host, so a reasonable approach might be that the guest can only create a socket or pty that can be opened by the host, but not vice versa. Alternatively, you create the socket/pty in host userspace and then allow passing that down into the guest, which creates a virtio device from it. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does macvtap support host to guest communication?
On Monday 18 April 2011, Asias He wrote: Hi, folks I am trying to use qemu/qemu-kvm with macvtap using following commands: # ip link add link eth0 name v0 type macvtap mode {vepa,bridge,private} # ip link set v0 address da:4e:17:88:42:b1 up # idx=`ip link show v0 | grep mtu| awk -F: '{print $1}'` # kvm -net nic,macaddr=da:4e:17:88:42:b1 -net tap,fd=3 -hda /home/asias/qemu-stuff/sid.img 3/dev/tap${idx} I found that guest can access other hosts on the LAN except the host where guest lives, and host where guest lives can not access guest. My question is: Does macvtap support host(hypervisor host) to guest communication? You can communicate between macvtap and macvlan devices when they are in bridge mode, but these devices cannot communicate with clients that run on the underlying device. Just add a macvlan device to your hardware interface and use that in the host instead of running on the low-level device directly. The other option is to use a vepa enabled bridge, but these are relatively rare. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does macvtap support host to guest communication?
On Monday 18 April 2011, Asias He wrote: (1) Is it possible to add an interface to macvtap like /dev/net/tun, eg, /dev/net/macvtap. Currently, it is hard to use macvtap programmatically. I decided against having a multiplexor device because it makes permission handling rather hard. One chardev per network interface makes it possible to handle permissions in multiuser setups. (2) Adding another macvlan device(e.g., macvlan0) to the hardware interface(e.g., eth0) and using it as the old eth0 make the process of using macvtap complicate. One has to reconfigure the network. This is not optimal from the user perspective. Is it possible to leave the low-level device as is when using the macvtap device? Only in VEPA mode. Note that a similar restriction applies when using the bridge device, for the same technical reasons. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does macvtap support host to guest communication?
On Monday 18 April 2011, Ingo Molnar wrote: Only in VEPA mode. Note that a similar restriction applies when using the bridge device, for the same technical reasons. Just to sum things up, our goal is to allow the tools/kvm/ unprivileged tool to provide TCP connectivity to Linux guests transparently, with the following parameters: - the kvm tool runs unprivileged - as ordinary user - without having to configure much (preferably zero configuration: without having to configure anything) on the guest Linux side - multiple guests should just work without interfering with each other - the kvm tool wants to be stateless - i.e. it does not want to allocate or manage host side devices - it just wants to provide the kind of TCP/IP connectivity host unprivileged user-space has, to the guest. The tool wants to be a generic tool with no global state, not a daemon. So it wants to be a stateless, unprivileged and zero-configuration solution. Is this possible with macvtap, and if yes, what kind of macvtap mode and usage would you recommend for that goal? With the above requirements, I would suggest using something like the the qemu user networking. This is slower and does not allow servers to be present in the guest, but those are not your goal as it seems. The primary goals of macvtap are to allow efficient networking (zero-copy, multi-queue, although we're not completely there yet) and proper security abstractions. If you want a guest to appear on the same network as the host, you can not do that without privileges to manage the host network setup, and I guess that will have to stay that way. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does macvtap support host to guest communication?
On Monday 18 April 2011, Asias He wrote: We do need guest appearing on the same network as the host support as well. The reason I am considering using macvatp instead of tap plus brctl is that it simplifies the bridge configuration and it is more efficient. Right, you certainly don't need to consider tap/brctl any more. However, IMHO, the interface of macvtap is not user friendly, at least for me. I have no idea about the technical reasons that make the low-level device inaccessible. But if it is accessible, a lot of configuration can be eliminated. I know virtualbox's bridge mode has this kind of restriction, while VMware's bridge mode does not. The main reason is that having a MAC address scan in the regular networking core would make the common TX case where there is no macvlan device more complex. Macvtap is derived from the plain macvlan driver, which used to support only sending out to the wire until I added the optional bridge mode. If you want a regular device to be able to send to a macvlan port, that would require at least these changes: * Add an option to put a plain device into macvlan-bridge mode * Add support for that option into iproute2 * Add a hook into dev_queue_xmit() to check for macvlan ports Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does macvtap support host to guest communication?
On Monday 18 April 2011, Asias He wrote: If you want a regular device to be able to send to a macvlan port, that would require at least these changes: * Add an option to put a plain device into macvlan-bridge mode * Add support for that option into iproute2 * Add a hook into dev_queue_xmit() to check for macvlan ports Cool! Arnd, mind to add this feature to macvtap? No, not after I just explained why I haven't done it before and why it's so controversial. Also, I have moved on to other projects and am no longer doing active development of the macvtap driver. I'd be happy to pass on the ownership to someone else and help him or her extend it. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers
On Tuesday 08 June 2010, Randy Dunlap wrote: Documentation/ioctl/ioctl-number.txt |1 Documentation/vfio.txt | 177 +++ MAINTAINERS |7 drivers/Kconfig |2 drivers/Makefile |1 drivers/vfio/Kconfig | 18 drivers/vfio/Makefile|6 drivers/vfio/uiommu.c| 126 + drivers/vfio/vfio_dma.c | 324 drivers/vfio/vfio_intrs.c| 191 +++ drivers/vfio/vfio_main.c | 624 + drivers/vfio/vfio_pci_config.c | 554 ++ drivers/vfio/vfio_rdwr.c | 147 + drivers/vfio/vfio_sysfs.c| 153 ++ include/linux/uiommu.h | 62 ++ include/linux/vfio.h | 200 16 files changed, 2593 insertions(+) This seems to be missing a change to include/linux/Kbuild that adds vfio.h to the exported files. Without the export, you cannot use the definitions from user space programs unless they come with their own copy of the header. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM call minutes for June 15
On Wednesday 16 June 2010, Markus Armbruster wrote: Can't hurt reviewer motivation. Could it be automated? Find replies, extract tags. If you want your acks to be picked up, you better make sure your References header works, and your tags are formatted correctly. I think pwclient (https://patchwork.kernel.org/) can do this for you. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-s390: Dont exit SIE on SIGP sense running
On Monday 21 June 2010, Christian Borntraeger wrote: Hmm, dont know. Currently this calls into a s390 debug tracing facility (arch/s390/kernel/debug.c) which is heavily used by our service folks. There are commands for crash and lcrash to show these s390 debug traces from a dump. Maybe its worth to investigate if we should change some of these events to have both ftrace-tracepoints and the debug traces. I think that it would be worthwhile to convert the entire s390 debug code to become tracepoints, either one by one or making it a subclass with the existing interfaces. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
On Friday 30 July 2010 17:51:52 Shirley Ma wrote: On Fri, 2010-07-30 at 16:53 +0800, Xin, Xiaohui wrote: Since vhost-net already supports macvtap/tun backends, do you think whether it's better to implement zero copy in macvtap/tun than inducing a new media passthrough device here? I'm not sure if there will be more duplicated code in the kernel. I think it should be less duplicated code in the kernel if we use macvtap to support what media passthrough driver here. Since macvtap has support virtio_net head and offloading already, the only missing func is zero copy. Also QEMU supports macvtap, we just need add a zero copy flag in option. Yes, I fully agree and that was one of the intended directions for macvtap to start with. Thank you so much for following up on that, I've long been planning to work on macvtap zero-copy myself but it's now lower on my priorities, so it's good to hear that you made progress on it, even if there are still performance issues. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
On Wednesday 04 August 2010, Dong, Eddie wrote: Arnd Bergmann wrote: On Friday 30 July 2010 17:51:52 Shirley Ma wrote: I think it should be less duplicated code in the kernel if we use macvtap to support what media passthrough driver here. Since macvtap has support virtio_net head and offloading already, the only missing func is zero copy. Also QEMU supports macvtap, we just need add a zero copy flag in option. Yes, I fully agree and that was one of the intended directions for macvtap to start with. Thank you so much for following up on that, I've long been planning to work on macvtap zero-copy myself but it's now lower on my priorities, so it's good to hear that you made progress on it, even if there are still performance issues. But zero-copy is a Linux generic feature that can be used by other VMMs as well if the BE service drivers want to incorporate. If we can make mp device VMM-agnostic (it may be not yet in current patch), that will help Linux more. But the tun/tap protocol is what most hypervisors use today on Linux, and one of the design goals of macvtap was to keep that interface so that everyone gets the features like zero-copy if that is added to macvtap. The mp device interface is currently not supported by anything else than vhost with these patches, and making it more generic would turn the interface into a copy of macvtap. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add definitions for current cpu models..
On Monday 18 January 2010, john cooper wrote: +.name = Conroe, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, that looks wrong -- what is model 2 actually? +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | CPUID_EXT_SSSE3, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Celeron_4x0 (Conroe/Merom Class Core 2), +}, Celeron_4x0 is a rather bad example, because it is based on the single-core Conroe-L, which is family 6 / model 22 unlike all the dual- and quad-core Merom/Conroe that are model 15. +{ +.name = Penryn, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Core 2 Duo P9xxx (Penryn Class Core 2), +}, This would be model 23 for Penryn-class Xeon/Core/Pentium/Celeron processors without L3 cache. +{ +.name = Nehalem, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */ +CPUID_PSE36,/* note 2 */ +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 | +CPUID_EXT_SSE42 | CPUID_EXT_POPCNT, +.ext2_features = (PPRO_FEATURES CPUID_EXT2_MASK) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_LAHF_LM, +.xlevel = 0x800A, +.model_id = Intel Core i7 9xx (Nehalem Class Core i7), +}, Apparently, not all the i7-9xx CPUs are Nehalem, the i7-980X is supposed to be Westmere, which has more features. Because of the complexity, I'd recommend passing down the *model* number of the emulated CPU, the interesting Intel ones (those supported by KVM) being: 15-6: CedarMill/Presler/Dempsey/Tulsa (Pentium 4/Pentium D/Xeon 50xx/Xeon 71xx) 6-14: Yonah/Sossaman (Celeron M4xx, Core Solo/Duo, Pentium Dual-Core T1000, Xeon ULV) 6-15: Merom/Conroe/Kentsfield/Woodcrest/Clovertown/Tigerton (Celeron M5xx/E1xxx/T1xxx, Pentium T2xxx/T3xxx/E2xxx,Core 2 Solo U2xxx, Core 2 Duo E4xxx/E6xxx/Q6xxx/T5xxx/T7xxx/L7xxx/U7xxx/SP7xxx, Xeon 30xx/32xx/51xx/52xx/72xx/73xx) 6-22: Penryn/Wolfdale/Yorkfield/Harpertown (Celeron 7xx/9xx/SU2xxx/T3xxx/E3xxx, Pentium T4xxx/SU2xxx/SU4xxx/E5xxx/E6xxx, Core 2 Solo SU3xxx, Core 2 Duo P/SU/T6xxx/x8xxx/x9xxx, Xeon 31xx/33xx/52xx/54xx) 6-26: Gainestown/Bloomfield (Xeon 35xx/55xx, Core i7-9xx) 6-28: Atom 6-29: Dunnington (Xeon 74xx) 6-30: Lynnfield/Clarksfield/JasperForest (Xeon 34xx, Core i7-8xx, Core i7-xxxQM, Core i5-7xx) 6-37: Arrandale/Clarkdale (Dual-Core Core i3/i5/i7) 6-44: Gulftown (six-core) Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Anthony Liguori wrote: The raw backend can be attached to a physical device This is equivalent to bridging with tun/tap except that it has the unexpected behaviour of unreliable host/guest networking (which is not universally consistent across platforms either). This is not a mode we want to encourage users to use. It's not the most common scenario, but I've seen systems (I remember one on s/390 with z/VM) where you really want to isolate the guest network as much as possible from the host network. Besides PCI passthrough, giving the host device to a guest using a raw socket is the next best approximation of that. Then again, macvtap will do that too, if the device driver supports multiple unicast MAC addresses without forcing promiscous mode. , macvlan macvtap is a superior way to achieve this use case because a macvtap fd can safely be given to a lesser privilege process without allowing escalation of privileges. Yes. or SR-IOV VF. This depends on vhost-net. Why? I don't see anything in this scenario that is vhost-net specific. I also plan to cover this aspect in macvtap in the future, but the current code does not do it yet. It also requires device driver changes. In general, what I would like to see for this is something more user friendly that dealt specifically with this use-case. Although honestly, given the recent security concerns around raw sockets, I'm very concerned about supporting raw sockets in qemu at all. Essentially, you get worse security doing vhost-net + raw + VF then with PCI passthrough + VF because at least in the later case you can run qemu without privileges. CAP_NET_RAW is a very big privilege. It can be contained to a large degree with network namespaces. When you run qemu in its own namespace and add the VF to that, CAP_NET_RAW should ideally have no effect on other parts of the system (except bugs in the namespace implementation). Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Michael S. Tsirkin wrote: I am not sure I agree with this sentiment. The main issue being that macvtap doesn't exist on all kernels :). macvlan also requires hardware support, packet socket can work with any network card in promisc mode. To be clear, macvlan does not require hardware support, it will happily put cards into promiscous mode if they don't support multiple mac addresses. I agree to that. People don't even seem to agree whether it's a raw socket or a packet socket :) We need a better name for this option: what it really does is rely on an external device to loopback a packet to us, so how about -net loopback or -net extbridge? I think -net socket,fd should just be (trivially) extended to work with raw sockets out of the box, with no support for opening it. Then you can have libvirt or some wrapper open a raw socket and a private namespace and just pass it down. If you really want to let qemu open the socket itself, -net socket,raw=eth0 is probably closer to what you want than a new -net xxx option. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Anthony Liguori wrote: I think -net socket,fd should just be (trivially) extended to work with raw sockets out of the box, with no support for opening it. Then you can have libvirt or some wrapper open a raw socket and a private namespace and just pass it down. That'd work. Anthony? The fundamental problem that I have with all of this is that we should not be introducing new network backends that are based around something only a developer is going to understand. If I'm a user and I want to use an external switch in VEPA mode, how in the world am I going to know that I'm supposed to use the -net raw backend or the -net socket backend? It might as well be the -net butterflies backend as far as a user is concerned. My point is that we already have -net socket,fd and any user that passes an fd into that already knows what he wants to do with it. Making it work with raw sockets is just a natural extension to this, which works on all kernels and (with separate namespaces) is reasonably secure. I fully agree that we should not introduce further network backends that would confuse users, but making the existing backends more flexible is something entirely different. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Sridhar Samudrala wrote: On Wed, 2010-01-27 at 22:39 +0100, Arnd Bergmann wrote: On Wednesday 27 January 2010, Anthony Liguori wrote: I think -net socket,fd should just be (trivially) extended to work with raw sockets out of the box, with no support for opening it. Then you can have libvirt or some wrapper open a raw socket and a private namespace and just pass it down. That'd work. Anthony? The fundamental problem that I have with all of this is that we should not be introducing new network backends that are based around something only a developer is going to understand. If I'm a user and I want to use an external switch in VEPA mode, how in the world am I going to know that I'm supposed to use the -net raw backend or the -net socket backend? It might as well be the -net butterflies backend as far as a user is concerned. My point is that we already have -net socket,fd and any user that passes an fd into that already knows what he wants to do with it. Making it work with raw sockets is just a natural extension to this, which works on all kernels and (with separate namespaces) is reasonably secure. Didn't realize that -net socket is already there and supports TCP and UDP sockets. I will look into extending -net socket to support AF_PACKET SOCK_RAW type sockets. Actually, Jens had a patch doing this in early 2009 already but we decided to not send that one out at the time after Or had sent his version of the raw socket interface, which was a superset. Maybe Jens can post his patch again if that still applies? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Wednesday 27 January 2010, Anthony Liguori wrote: Introducing something that is known to be problematic from a security perspective without any clear idea of what the use-case for it is is a bad idea IMHO. vepa on existing kernels is one use-case. Considering VEPA enabled hardware doesn't exist today and the standards aren't even finished being defined, I don't think it's a really strong use case ;-) The hairpin turn (the part that is required on the bridge) was implemented in the Linux bridge in 2.6.32, so that is one existing implementation you can use as a peer. The VEPA mode in macvlan only made it into 2.6.33, so using the raw socket on older kernels does not give you actual VEPA semantics. The part of the standard that is still under discussion is the management side, which is almost entirely unrelated to this question though. With Linux-2.6.33 on both sides using raw/macvlan and bridge respectively, you can have a working VEPA setup. The only thing missing is that the hypervisor will not be able to tell the bridge to automatically enable hairpin mode (you need to do that on the bridge on a per-port basis). Now, the most important use case I see for the raw socket interface in qemu is to get vhost-net and the qemu user implementation to support the same feature set. If you ask for a network setup involving a raw socket and vhost-net and the kernel can support raw sockets but for some reason fails to set up vhost-net, you should have a fallback that has the exact same semantics at a possibly significant performance loss. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
On Monday 25 January 2010, Dor Laor wrote: x86 qemu64 x86 phenom x86 core2duo x86kvm64 x86 qemu32 x86 coreduo x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon x86 n270 I think a really nice addition would be an autodetect option for those users (e.g. desktop) that know they do not want to migrate the guest to a lower-spec machine. That option IMHO should just show up as identical to the host cpu, with the exception of features that are not supported in the guest. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu
On Thursday 28 January 2010, Anthony Liguori wrote: normal user uses libvirt to launch custom qemu instance. libvirt passes an fd of a raw socket to qemu and puts the qemu process in a restricted network namespace. user has another program running listening on a unix domain socket and does something to the qemu process that causes it to open the domain socket and send the fd it received from libvirt via SCM_RIGHTS. I looked at the af_unix code and it seems to suggest that this is not possible, because you cannot bind to a socket that belongs to a different network namespace. I haven't tried it though, so I may have missed something. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.
On Wednesday 10 February 2010, Xin Xiaohui wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. We provide multiple submits and asynchronous notifiicaton to vhost-net too. This does a lot of things that I had planned for macvtap. It's great to hear that you have made this much progress. However, I'd hope that we could combine this with the macvtap driver, which would give us zero-copy transfer capability both with and without vhost, as well as (tx at least) when using multiple guests on a macvlan setup. For transmit, it should be fairly straightforward to hook up your zero-copy method and the vhost-net interface into the macvtap driver. You have simplified the receiv path significantly by assuming that the entire netdev can receive into a single guest, right? I'm assuming that the idea is to allow VMDq adapters to simply show up as separate adapters and have the driver handle this in a hardware specific way. My plan for this was to instead move support for VMDq into the macvlan driver so we can transparently use VMDq on hardware where available, including zero-copy receives, but fall back to software operation on non-VMDq hardware. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.
On Thursday 11 February 2010, Xin, Xiaohui wrote: This does a lot of things that I had planned for macvtap. It's great to hear that you have made this much progress. However, I'd hope that we could combine this with the macvtap driver, which would give us zero-copy transfer capability both with and without vhost, as well as (tx at least) when using multiple guests on a macvlan setup. You mean the zero-copy can work with macvtap driver without vhost. May you give me some detailed info about your macvtap driver and the relationship between vhost and macvtap to make me have a clear picture then? macvtap provides a user interface that is largely compatible with the tun/tap driver, and can be used in place of that from qemu. Vhost-net currently interfaces with tun/tap, but not yet with macvtap, which is easy enough to add and already on my list. The underlying code is macvlan, which is a driver that virtualizes network adapters in software, giving you multiple net_device instances for a real NIC, each of them with their own MAC address. In order to do zero-copy transmit with macvtap, the idea is to add a nonblocking version of the aio_write() function that works a lot like your transmit function. For receive, the hardware does not currently know which guest is supposed to get any frame coming in from the outside. Adding zero-copy receive requires interaction with the device driver and hardware capabilities to separate traffic by inbound MAC address into separate buffers per VM. I'm assuming that the idea is to allow VMDq adapters to simply show up as separate adapters and have the driver handle this in a hardware specific way. Does the VMDq driver do so now? I don't think anyone has published a VMDq capable driver so far. I was just assuming that you were working on one. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Inter-VM shared memory PCI device
On Monday 08 March 2010, Cam Macdonell wrote: enum ivshmem_registers { IntrMask = 0, IntrStatus = 2, Doorbell = 4, IVPosition = 6, IVLiveList = 8 }; The first two registers are the interrupt mask and status registers. Interrupts are triggered when a message is received on the guest's eventfd from another VM. Writing to the 'Doorbell' register is how synchronization messages are sent to other VMs. The IVPosition register is read-only and reports the guest's ID number. The IVLiveList register is also read-only and reports a bit vector of currently live VM IDs. The Doorbell register is 16-bits, but is treated as two 8-bit values. The upper 8-bits are used for the destination VM ID. The lower 8-bits are the value which will be written to the destination VM and what the guest status register will be set to when the interrupt is trigger is the destination guest. A value of 255 in the upper 8-bits will trigger a broadcast where the message will be sent to all other guests. This means you have at least two intercepts for each message: 1. Sender writes to doorbell 2. Receiver gets interrupted With optionally two more intercepts in order to avoid interrupting the receiver every time: 3. Receiver masks interrupt in order to process data 4. Receiver unmasks interrupt when it's done and status is no longer pending I believe you can do much better than this, you combine status and mask bits, making this level triggered, and move to a bitmask of all guests: In order to send an interrupt to another guest, the sender first checks the bit for the receiver. If it's '1', no need for any intercept, the receiver will come back anyway. If it's zero, write a '1' bit, which gets OR'd into the bitmask by the host. The receiver gets interrupted at a raising edge and just leaves the bit on, until it's done processing, then turns the bit off by writing a '1' into its own location in the mask. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Inter-VM shared memory PCI device
On Tuesday 09 March 2010, Cam Macdonell wrote: We could make the masking in RAM, not in registers, like virtio, which would require no exits. It would then be part of the application specific protocol and out of scope of of this spec. This kind of implementation would be possible now since with UIO it's up to the application whether to mask interrupts or not and what interrupts mean. We could leave the interrupt mask register for those who want that behaviour. Arnd's idea would remove the need for the Doorbell and Mask, but we will always need at least one MMIO register to send whatever interrupts we do send. You'd also have to be very careful if the notification is in RAM to avoid races between one guest triggering an interrupt and another guest clearing its interrupt mask. A totally different option that avoids this whole problem would be to separate the signalling from the shared memory, making the PCI shared memory device a trivial device with a single memory BAR, and using something a higher-level concept like a virtio based serial line for the actual signalling. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Inter-VM shared memory PCI device
On Thursday 11 March 2010, Avi Kivity wrote: A totally different option that avoids this whole problem would be to separate the signalling from the shared memory, making the PCI shared memory device a trivial device with a single memory BAR, and using something a higher-level concept like a virtio based serial line for the actual signalling. That would be much slower. The current scheme allows for an ioeventfd/irqfd short circuit which allows one guest to interrupt another without involving their qemus at all. Yes, the serial line approach would be much slower, but my point was that we can do signaling over something else, which could well be something building on irqfd. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Inter-VM shared memory PCI device
On Thursday 11 March 2010, Avi Kivity wrote: That would be much slower. The current scheme allows for an ioeventfd/irqfd short circuit which allows one guest to interrupt another without involving their qemus at all. Yes, the serial line approach would be much slower, but my point was that we can do signaling over something else, which could well be something building on irqfd. Well, we could, but it seems to make things more complicated? A card with shared memory, and another card with an interrupt interconnect? Yes, I agree that it's more complicated if you have a specific application in mind that needs one of each, and most use cases that want shared memory also need an interrupt mechanism, but it's not always the case: - You could use ext2 with -o xip on a private mapping of a shared host file in order to share the page cache. This does not need any interrupts. - If you have more than two parties sharing the segment, there are different ways to communicate, e.g. always send an interrupt to all others, or have dedicated point-to-point connections. There is also some complexity in trying to cover all possible cases in one driver. I have to say that I also really like the idea of futex over shared memory, which could potentially make this all a lot simpler. I don't know how this would best be implemented on the host though. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copyless virtio net thoughts?
On Wednesday 18 February 2009, Rusty Russell wrote: 2) Direct NIC attachment This is particularly interesting with SR-IOV or other multiqueue nics, but for boutique cases or benchmarks, could be for normal NICs. So far I have some very sketched-out patches: for the attached nic dev_alloc_skb() gets an skb from the guest (which supplies them via some kind of AIO interface), and a branch in netif_receive_skb() which returned it to the guest. This bypasses all firewalling in the host though; we're basically having the guest process drive the NIC directly. If this is not passing the PCI device directly to the guest, but uses your concept, wouldn't it still be possible to use the firewalling in the host? You can always inspect the headers, drop the frame, etc without copying the whole frame at any point. When it gets to the point of actually giving the (real pf or sr-iov vf) to one guest, you really get to the point where you can't do local firewalling any more. 3) Direct interguest networking Anthony has been thinking here: vmsplice has already been mentioned. The idea of passing directly from one guest to another is an interesting one: using dma engines might be possible too. Again, host can't firewall this traffic. Simplest as a dedicated internal lan NIC, but we could theoretically do a fast-path for certain MAC addresses on a general guest NIC. Another option would be to use an SR-IOV adapter from multiple guests, with a virtual ethernet bridge in the adapter. This moves the overhead from the CPU to the bus and/or adapter, so it may or may not be a real benefit depending on the workload. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copyless virtio net thoughts?
On Thursday 19 February 2009, Rusty Russell wrote: Not quite: I think PCI passthrough IMHO is the *wrong* way to do it: it makes migrate complicated (if not impossible), and requires emulation or the same NIC on the destination host. This would be the *host* seeing the virtual functions as multiple NICs, then the ability to attach a given NIC directly to a process. I guess what you mean then is what Intel calls VMDq, not SR-IOV. Eddie has some slides about this at http://docs.huihoo.com/kvm/kvmforum2008/kdf2008_7.pdf . The latest network cards support both operation modes, and it appears to me that there is a place for both. VMDq gives you the best performance without limiting flexibility, while SR-IOV performance in theory can be even better, but sacrificing a lot of flexibility and potentially local (guest-to-gest) performance. AFAICT, any card that supports SR-IOV should also allow a VMDq like model, as you describe. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove static declaration from wall clock version
On Thursday 26 February 2009, Glauber Costa wrote: @@ -548,15 +548,13 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data) static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) { - static int version; + int version = 1; struct pvclock_wall_clock wc; struct timespec now, sys, boot; if (!wall_clock) return; - version++; - kvm_write_guest(kvm, wall_clock, version, sizeof(version)); /* Doesn't this mean that kvm_write_guest now writes an uninitialized value to the guest? I think what you need here is a 'static atomic_t version;' so you can do an atomic_inc instead of the ++. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove static declaration from wall clock version
On Friday 27 February 2009, Glauber Costa wrote: Doesn't this mean that kvm_write_guest now writes an uninitialized value to the guest? No. If you look closely, it's now initialized to 1. Right, I didn't see that change at first. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: Intel-Matrix for VT-capability?
On Friday 24 April 2009, Oliver Rath wrote: Hi, Im looking for an abillity seeing vt-capability of Intel-processors by there name :-/ I.e. T7200 has vt, T3400 has not. Exists a rule for the naming scheme seeing vt-capability? Alternatively, exists a matrix anywhere in the net for this? Im tired searching for vt-capability for every new OEM-intel-processor. On intel-site _did_ exist a pdf-table (not all processors, but most of T-series), but it seems to be removed. http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors is quite good here. As a rule of thumb, anything higher than 6000 will have VT, anything below 6000 will not. Interesting exceptions are Doesn't have VT: E7300, Q8200, Q8400, E8190 Does have VT: T5600, U2xxx, SU3xxx, Celeron 900 (?) May have VT[1]: T5500, Q8300, E7400, E7500, E5300, E5400 Interestingly, when you look at the price list, you will see that *all* processors that are not being obviously phased out (i.e. have the same or higher price as a superior model) and carry a Pentium or Core 2 name come with VT enabled. I think it's very unlikely that they will come out with anything new that does not run KVM. Arnd [1] http://www.heise.de/newsticker/Auch-billigere-Intel-Prozessoren-bald-mit-Virtualisierungsbefehlen--/meldung/136306 [2] http://www.intc.com/priceList.cfm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: Intel-Matrix for VT-capability?
On Tuesday 28 April 2009, Marc Bevand wrote: As a rule of thumb, anything higher than 6000 will have VT, anything below 6000 will not. Interesting exceptions are Doesn't have VT: E7300, Q8200, Q8400, E8190 Does have VT: T5600, U2xxx, SU3xxx, Celeron 900 (?) May have VT[1]: T5500, Q8300, E7400, E7500, E5300, E5400 Interestingly, when you look at the price list, you will see that *all* processors that are not being obviously phased out (i.e. have the same or higher price as a superior model) and carry a Pentium or Core 2 name come with VT enabled. This is very wrong: - none of the Pentium, Celeron, Atom processors, even the latest ones, come with VT All Pentium and Celeron processors have numbers below 6000, so they fit in the rule of thumb I gave above. The Celeron 900 is listed as having VT on the processorfinder, but that also incorrectly lists it as having two cores, so who knows? - none of the Core 2 Duo E7xxx and Core 2 Quad Q8xxx support VT Be very careful into what you buy, check processorfinder.intel.com. Interestingly I found out that Intel will enable VT on a very small number of Core 2 and Pentium processors on June 12: http://www.tcmagazine.com/comments.php?shownews=25886 These are the ones I listed above as 'may have VT': Q8300, E7400, E7500, E5300, E5400. The link I gave was to another article mentioning these exact numbers. The E7300, Q8200 and Q8400 I listed as 'Doesn't have VT' are the remaining ones E7xxx and Q8xxx processors, which appear to be phased out (not sure about Q8400, which was announced at the same time as this news). Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM timekeeping 30/35] IOCTL for setting TSC rate
On Friday 20 August 2010 19:56:20 Glauber Costa wrote: @@ -675,6 +676,9 @@ struct kvm_clock_data { #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) /* Available with KVM_CAP_PPC_GET_PVINFO */ #define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo) +/* Available with KVM_CAP_SET_TSC_RATE */ +#define KVM_X86_GET_TSC_RATE _IOR(KVMIO, 0xa2, __u32) +#define KVM_X86_SET_TSC_RATE _IOW(KVMIO, 0xa3, __u32) wrap this into a struct? I don't think that would improve the code. Generally, we try to *avoid* using structs in ioctl arguments, although KVM does have a precedent of using structs there. In fact, the code here could be simplified by using get_user/put_user on the simple argument, which would not be possibly with a struct. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
On Wednesday 08 September 2010, Krishna Kumar2 wrote: The new guest and qemu code work with old vhost-net, just with reduced performance, yes? Yes, I have tested new guest/qemu with old vhost but using #numtxqs=1 (or not passing any arguments at all to qemu to enable MQ). Giving numtxqs 1 fails with ENOBUFS in vhost, since vhost_net_set_backend in the unmodified vhost checks for boundary overflow. I have also tested running an unmodified guest with new vhost/qemu, but qemu should not specify numtxqs1. Can you live migrate a new guest from new-qemu/new-kernel to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel? If not, do we need to support all those cases? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
On Wednesday 08 September 2010, Krishna Kumar2 wrote: On Wednesday 08 September 2010, Krishna Kumar2 wrote: The new guest and qemu code work with old vhost-net, just with reduced performance, yes? Yes, I have tested new guest/qemu with old vhost but using #numtxqs=1 (or not passing any arguments at all to qemu to enable MQ). Giving numtxqs 1 fails with ENOBUFS in vhost, since vhost_net_set_backend in the unmodified vhost checks for boundary overflow. I have also tested running an unmodified guest with new vhost/qemu, but qemu should not specify numtxqs1. Can you live migrate a new guest from new-qemu/new-kernel to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel? If not, do we need to support all those cases? I have not tried this, though I added some minimal code in virtio_net_load and virtio_net_save. I don't know what needs to be done exactly at this time. I forgot to put this in the Next steps list of things to do. I was mostly trying to find out if you think it should work or if there are specific reasons why it would not. E.g. when migrating to a machine that has an old qemu, the guest gets reduced to a single queue, but it's not clear to me how it can learn about this, or if it can get hidden by the outbound qemu. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
On Tuesday 14 September 2010, Shirley Ma wrote: On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote: That's what io_submit() is for. Then io_getevents() tells you what a while actually was. This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. I guess the best solution would be to make macvtap_aio_write return -EIOCBQUEUED when a packet gets passed down to the adapter, and call aio_complete when the adapter is done with it. This would change the regular behavior of macvtap into a model where every write on the file blocks until the packet has left the machine, which gives us better flow control, but does slow down the traffic when we only put one packet at a time into the queue. It also allows the user to call io_submit instead of write in order to do an asynchronous submission as Avi was suggesting. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Tuesday 28 September 2010, Michael S. Tsirkin wrote: + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EAGAIN; + } + + skb-protocol = eth_type_trans(skb, mp-dev); Why are you calling eth_type_trans() on transmit? So that GSO can work. BTW macvtap does: skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb-protocol = eth_hdr(skb)-h_proto; and I think this is broken for vlans. Arnd? Hmm, that code (besides set_network_header) was added by Sridhar for GSO support. I believe I originally did eth_type_trans but had to change it before that time because it broke something. Unfortunately, my memory on that is not very good any more. Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Tuesday 28 September 2010, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote: Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? I expect the protocol value to be wrong when guest sends vlan tagged frames as 802.1q frames have a different format. Ok, I see. Would that be fixed by using eth_type_trans()? I don't see any code in there that tries to deal with the VLAN tag, so do we have the same problem in the tun/tap driver? Also, I wonder how we handle the case where both the guest and the host do VLAN tagging. Does the host transparently override the guest tag, or does it add a nested tag? More importantly, what should it do? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tuesday 05 October 2010, Krishna Kumar2 wrote: After testing various combinations of #txqs, #vhosts, #netperf sessions, I think the drop for 1 stream is due to TX and RX for a flow being processed on different cpus. I did two more tests: 1. Pin vhosts to same CPU: - BW drop is much lower for 1 stream case (- 5 to -8% range) - But performance is not so high for more sessions. 2. Changed vhost to be single threaded: - No degradation for 1 session, and improvement for upto 8, sometimes 16 streams (5-12%). - BW degrades after that, all the way till 128 netperf sessions. - But overall CPU utilization improves. Summary of the entire run (for 1-128 sessions): txq=4: BW: (-2.3) CPU: (-16.5)RCPU: (-5.3) txq=16: BW: (-1.9) CPU: (-24.9)RCPU: (-9.6) I don't see any reasons mentioned above. However, for higher number of netperf sessions, I see a big increase in retransmissions: ___ #netperf ORG NEW BW (#retr)BW (#retr) ___ 1 70244 (0) 64102 (0) 4 21421 (0) 36570 (416) 8 21746 (0) 38604 (148) 16 21783 (0) 40632 (464) 32 22677 (0) 37163 (1053) 64 23648 (4) 36449 (2197) 12823251 (2) 31676 (3185) ___ This smells like it could be related to a problem that Ben Greear found recently (see macvlan: Enable qdisc backoff logic). When the hardware is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN to qemu (or vhost-net) to trigger a resend. I suppose what we really should do is feed that condition back to the guest network stack and implement the backoff in there. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wednesday 06 October 2010 19:14:42 Krishna Kumar2 wrote: Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM: I don't see any reasons mentioned above. However, for higher number of netperf sessions, I see a big increase in retransmissions: ___ #netperf ORG NEW BW (#retr)BW (#retr) ___ 1 70244 (0) 64102 (0) 4 21421 (0) 36570 (416) 8 21746 (0) 38604 (148) 16 21783 (0) 40632 (464) 32 22677 (0) 37163 (1053) 64 23648 (4) 36449 (2197) 12823251 (2) 31676 (3185) ___ This smells like it could be related to a problem that Ben Greear found recently (see macvlan: Enable qdisc backoff logic). When the hardware is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN to qemu (or vhost-net) to trigger a resend. I suppose what we really should do is feed that condition back to the guest network stack and implement the backoff in there. Thanks for the pointer. I will take a look at this as I hadn't seen this patch earlier. Is there any way to figure out if this is the issue? I think a good indication would be if this changes with/without the patch, and if you see -EAGAIN in qemu with the patch applied. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed
On Thursday 14 October 2010 21:58:08 Alex Williamson wrote: If it works anywhere (I assume it works on 32bit), then it's only because it happened to get the alignment right. This just makes 64bit hosts get it right too. I don't see any compatibility issues, non-packed + 64bit = broken. Thanks, I would actually assume that only x86-32 hosts got it right, because all 32 bit hosts I've seen other than x86 also define 8 byte alignment for uint64_t. You might however consider making it __attribute((__packed__, __aligned__(4))) instead of just packed, because otherwise you make the alignment one byte, which is not only different from what it used to be on x86-32 but also will cause inefficient compiler outpout on platforms that don't have unaligned word accesses in hardware. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed
On Thursday 14 October 2010 22:59:04 Alex Williamson wrote: The structs in question only contain 4 8 byte elements, so there shouldn't be any change on x86-32 using one-byte aligned packing. I'm talking about the alignment of the structure, not the members within the structure. The data structure should be compatible, but not accesses to it. AFAIK, e820 is x86-only, so we don't need to worry about breaking anyone else. You can use qemu to emulate an x86 pc on anything... Performance isn't much of a consideration for this type of interface since it's only used pre-boot. In fact, the channel between qemu and the bios is only one byte wide, so wider alignment can cost extra emulated I/O accesses. Right, the data gets passed as bytes, so it hardly matters in the end. Still the e820_add_entry assigns data to the struct members, which it either does using byte accesses and shifts or a multiple 32 bit assignment. Just because using a one byte alignment technically results in correct output doesn't make it the right solution. I don't care about the few cycles of execution time or the few bytes you waste in this particular case, but you are setting a wrong example by using smaller alignment than necessary. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] pc: e820 qemu_cfg tables need to be packed
On Friday 15 October 2010, Alex Williamson wrote: We can't let the compiler define the alignment for qemu_cfg data. Signed-off-by: Alex Williamson alex.william...@redhat.com --- v2: Adjust alignment to help non-x86 hosts per Arnd's suggestion Ok, looks good now. Thanks! Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TODO item: guest programmable mac/vlan filtering with macvtap
On Friday 15 October 2010, Michael S. Tsirkin wrote: On Thu, Oct 14, 2010 at 11:40:52PM +0200, Dragos Tatulea wrote: Hi, I'm starting a thread related to the TODO item mentioned in the subject. Currently still gathering info and trying to make kvm macvtap play nicely together. I have used this [1] guide to set it up but qemu is still complaining about the PCI device address of the virtio-net-pci. Tried with latest qemu. Am I missing something here? [1] - http://virt.kernelnewbies.org/MacVTap It really should be: -net nic,model=virtio,netdev=foo -netdev tap,id=foo Created account but still could not edit the wiki. Arnd, know why that is? Could you correct qemu command line pls? I also have lost write access to the wiki, no idea what happened there. I started the page, but it subsequently became protected. We never added support for the qemu command line directly, the plan was to do that using helper scripts. The only way to do it is to redirect both input and output to the tap device, so you ned to do -net nic,model=virtio,netdev=foo -netdev tap,id=foo,fd=3 3 when starting from bash. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/22] bitops: rename generic little-endian bitops functions
On Thursday 21 October 2010, Akinobu Mita wrote: As a preparation for providing little-endian bitops for all architectures, This removes generic_ prefix from little-endian bitops function names in asm-generic/bitops/le.h. s/generic_find_next_le_bit/find_next_le_bit/ s/generic_find_next_zero_le_bit/find_next_zero_le_bit/ s/generic_find_first_zero_le_bit/find_first_zero_le_bit/ s/generic___test_and_set_le_bit/__test_and_set_le_bit/ s/generic___test_and_clear_le_bit/__test_and_clear_le_bit/ s/generic_test_le_bit/test_le_bit/ s/generic___set_le_bit/__set_le_bit/ s/generic___clear_le_bit/__clear_le_bit/ s/generic_test_and_set_le_bit/test_and_set_le_bit/ s/generic_test_and_clear_le_bit/test_and_clear_le_bit/ Signed-off-by: Akinobu Mita akinobu.m...@gmail.com Cc: Hans-Christian Egtvedt hans-christian.egtv...@atmel.com Cc: Geert Uytterhoeven ge...@linux-m68k.org Cc: Roman Zippel zip...@linux-m68k.org Cc: Andreas Schwab sch...@linux-m68k.org Cc: linux-m...@lists.linux-m68k.org Cc: Greg Ungerer g...@uclinux.org Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: linuxppc-...@lists.ozlabs.org Cc: Andy Grover andy.gro...@oracle.com Cc: rds-de...@oss.oracle.com Cc: David S. Miller da...@davemloft.net Cc: net...@vger.kernel.org Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Cc: kvm@vger.kernel.org Acked-by: Arnd Bergmann a...@arndb.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices
On Wednesday 27 October 2010, Sridhar Samudrala wrote: Support a new 'passthru' mode with macvlan and 'mode' parameter with macvtap devices. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Can you split this into two patches? We definitely want the part adding support for macvtap device mode setting now. The new passthru mode for macvlan and macvtap probably needs some discussion and the patch in iproute2 will depends on the kernel patch getting merged first. I've added Stephen to the Cc list, he should also take a look. Arnd diff --git a/include/linux/if_link.h b/include/linux/if_link.h index f5bb2dc..23de79e 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -230,6 +230,7 @@ enum macvlan_mode { MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */ + MACVLAN_MODE_PASSTHRU = 8, /* take over the underlying device */ }; /* SR-IOV virtual function management section */ diff --git a/ip/Makefile b/ip/Makefile index 2f223ca..6054e8a 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \ ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \ ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \ iplink_vlan.o link_veth.o link_gre.o iplink_can.o \ -iplink_macvlan.o +iplink_macvlan.o iplink_macvtap.o RTMONOBJ=rtmon.o diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c index a3c78bd..97787f9 100644 --- a/ip/iplink_macvlan.c +++ b/ip/iplink_macvlan.c @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv, mode = MACVLAN_MODE_VEPA; else if (strcmp(*argv, bridge) == 0) mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; else return mode_arg(); @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[] mode == MACVLAN_MODE_PRIVATE ? private : mode == MACVLAN_MODE_VEPA? vepa : mode == MACVLAN_MODE_BRIDGE ? bridge + : mode == MACVLAN_MODE_PASSTHRU ? passthru :unknown); } diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c new file mode 100644 index 000..040cc68 --- /dev/null +++ b/ip/iplink_macvtap.c @@ -0,0 +1,93 @@ +/* + * iplink_macvtap.c macvtap device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void explain(void) +{ + fprintf(stderr, + Usage: ... macvtap mode { private | vepa | bridge | passthru }\n + ); +} + +static int mode_arg(void) +{ +fprintf(stderr, Error: argument of \mode\ must be \private\, + \vepa\ or \bridge\ \passthru\\n); +return -1; +} + +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, mode) == 0) { + __u32 mode = 0; + NEXT_ARG(); + + if (strcmp(*argv, private) == 0) + mode = MACVLAN_MODE_PRIVATE; + else if (strcmp(*argv, vepa) == 0) + mode = MACVLAN_MODE_VEPA; + else if (strcmp(*argv, bridge) == 0) + mode = MACVLAN_MODE_BRIDGE; + else if (strcmp(*argv, passthru) == 0) + mode = MACVLAN_MODE_PASSTHRU; + else + return mode_arg(); + + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, macvtap: what is \%s\?\n, *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + __u32 mode; + + if
Re: [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device
On Wednesday 27 October 2010, Sridhar Samudrala wrote: With the current default macvtap mode, a KVM guest using virtio with macvtap backend has the following limitations. - cannot change/add a mac address on the guest virtio-net - cannot create a vlan device on the guest virtio-net - cannot enable promiscuous mode on guest virtio-net This patch introduces a new mode called 'passthru' when creating a macvlan device which allows takeover of the underlying device and passing it to a guest using virtio with macvtap backend. Only one macvlan device is allowed in passthru mode and it inherits the mac address from the underlying device and sets it in promiscuous mode to receive and forward all the packets. Interesting approach. It somewhat stretches the definition of the macvlan concept, but it does sound useful to have. I was thinking about adding a new tap frontend driver that could share some code with macvtap and do only the takeover but not use macvlan as a base. I believe that would be a cleaner abstraction, but your code has two advantages in that the implementation is much simpler and that it can share a fair amount of the infrastructure that we're putting into qemu/libvirt/etc. Arnd PS: Please add a Signed-off-by: line when sending a patch, even for discussion. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2] Support 'mode' parameter when creating macvtap device
On Friday 29 October 2010, Sridhar Samudrala wrote: Add support for 'mode' parameter when creating a macvtap device. This allows a macvtap device to be created in bridge, private or the default vepa modes. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Acked-by: Arnd Bergmann a...@arndb.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] macvlan: Introduce 'passthru' mode to takeover the underlying device
On Friday 29 October 2010, Sridhar Samudrala wrote: With the current default 'vepa' mode, a KVM guest using virtio with macvtap backend has the following limitations. - cannot change/add a mac address on the guest virtio-net I believe this could be changed if there is a neeed, but I actually consider it one of the design points of macvlan that the guest is not able to change the mac address. With 802.1Qbg you rely on the switch being able to identify the guest by its MAC address, which the host kernel must ensure. - cannot create a vlan device on the guest virtio-net Why not? If this doesn't work, it's probably a bug! Why does the passthru mode enable it if it doesn't work already? - cannot enable promiscuous mode on guest virtio-net Could you elaborate why such a setup would be useful? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/10] UAPI: Put a comment into uapi/asm-generic/kvm_para.h and use it from arches
On Wednesday 17 October 2012, David Howells wrote: Make uapi/asm-generic/kvm_para.h non-empty by addition of a comment to stop the patch program from deleting it when it creates it. Then delete empty arch-specific uapi/asm/kvm_para.h files and tell the Kbuild files to use the generic instead. Should this perhaps instead be a #warning or #error that the facility is unsupported on this arch? Just an empty file is fine by me, but an #error also sounds reasonable if we want users to be able to write autoconf tests for it. Signed-off-by: David Howells dhowe...@redhat.com cc: Arnd Bergmann a...@arndb.de cc: Avi Kivity a...@redhat.com cc: Marcelo Tosatti mtosa...@redhat.com cc: kvm@vger.kernel.org Acked-by: Arnd Bergmann a...@arndb.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] compat_ioctl: fix warning caused by qemu
On Friday 01 July 2011, Johannes Stezenbach wrote: On Linux x86_64 host with 32bit userspace, running qemu or even just qemu-img create -f qcow2 some.img 1G causes a kernel warning: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(5326){t:'S';sz:0} arg(7fff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img ioctl 5326 is CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. The warning appears because the Linux compat-ioctl handler for these ioctls only applies to block devices, while qemu also uses the ioctls on plain files. Signed-off-by: Johannes Stezenbach j...@sig21.net Acked-by: Arnd Bergmann a...@arndb.de --- (resend with Cc: suggested by get_maintainer.pl) discussed in http://lkml.kernel.org/r/20110617090424.ga19...@sig21.net Arnd, is this what you had in mind, or did you mean to move all floppy compat definitions? I decided to go with the minimal change. Tested on both 2.6.39.2 and 3.0-rc5-63-g0d72c6f. Yes, that should be fine, unless Jens would like to see a different solution for the struct definitions, e.g. moving all of the floppy compat ioctl numbers to fd.h. I'm fine with it either way. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH] KVM: introduce xinterface API for external interaction with guests
On Thursday 16 July 2009, Gregory Haskins wrote: Background: The original vbus code was tightly integrated with kvm.ko. Avi suggested that we abstract the interfaces such that it could live outside of kvm. The code is still highly kvm-specific, you would not be able to use it with another hypervisor like lguest or vmware player, right? Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then issue a KVM_GET_VMID operation to acquire the u64-based vmid, and pass it to ioctl(foofd, FOO_SET_VMID, vmid). Upon receipt, the foo module can issue kvm_xinterface_find(vmid) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). Your approach allows passing the vmid from a process that does not own the kvm context. This looks like an intentional feature, but I can't see what this gains us. As a final measure, we link the xinterface code statically into the kernel so that callers are guaranteed a stable interface to kvm_xinterface_find() without implicitly pinning kvm.ko or racing against it. I also don't understand this. Are you worried about driver modules breaking when an externally-compiled kvm.ko is loaded? The same could be achieved by defining your data structures kvm_xinterface_ops and kvm_xinterface in a kernel header that is not shipped by kvm-kmod but always taken from the kernel headers. It does not matter if the entry points are build into the kernel or exported from a kvm.ko as long as you define a fixed ABI. What is the problem with pinning kvm.ko from another module using its features? Can't you simply provide a function call to lookup the kvm context pointer from the file descriptor to achieve the same functionality? To take that thought further, maybe the dependency can be turned around: If every user (pci-uio, virtio-net, ...) exposes a file descriptor based interface to user space, you can have a kvm ioctl to register the object behind that file descriptor with an existing kvm context to associate it with a guest. That would nicely solve the life time questions by pinning the external object for the life time of the kvm context rather than the other way round, and it would be completely separate from kvm in that each such object could be used by other subsystems independent of kvm. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH] KVM: introduce xinterface API for external interaction with guests
On Thursday 16 July 2009, Gregory Haskins wrote: Arnd Bergmann wrote: Your approach allows passing the vmid from a process that does not own the kvm context. This looks like an intentional feature, but I can't see what this gains us. This work is towards the implementation of lockless-shared-memory subsystems, which includes ring constructs such as virtio-ring, VJ-netchannels, and vbus-ioq. I find that these designs perform optimally when you allow two distinct contexts (producer + consumer) to process the ring concurrently, which implies a disparate context from the guest in question. Note that the infrastructure we are discussing does not impose a requirement for the contexts to be unique: it will work equally well from the same or a different process. For an example of this producer/consumer dynamic over shared memory in action, please refer to my previous posting re: vbus http://lkml.org/lkml/2009/4/21/408 I am working on v4 now, and this patch is part of the required support. Ok. I can see how your approach gives you more flexibility in this regard, but it does not seem critical. But to your point, I suppose the dependency lifetime thing is not a huge deal. I could therefore modify the patch to simply link xinterface.o into kvm.ko and still achieve the primary objective by retaining ops-owner. Right. And even if it's a separate module, holding an extra reference on kvm.ko will not cause any harm. Can't you simply provide a function call to lookup the kvm context pointer from the file descriptor to achieve the same functionality? You mean so have: struct kvm_xinterface *kvm_xinterface_find(int fd) (instead of creating our own vmid namespace) ? Or are you suggesting using fget() instead of kvm_xinterface_find()? I guess they are roughly equivalent. Either you pass a fd to kvm_xinterface_find, or pass the struct file pointer you get from fget. The latter is probably more convenient because it allows you to pass around the struct file in kernel contexts that don't have that file descriptor open. To take that thought further, maybe the dependency can be turned around: If every user (pci-uio, virtio-net, ...) exposes a file descriptor based interface to user space, you can have a kvm ioctl to register the object behind that file descriptor with an existing kvm context to associate it with a guest. FWIW: We do that already for the signaling path (see irqfd and ioeventfd in kvm.git). Each side exposes interfaces that accept eventfds, and the fds are passed around that way. However, for the functions we are talking about now, I don't think it really works well to go the other way. I could be misunderstanding what you mean, though. What I mean is that it's KVM that is providing a service to the other modules (in this case, translating memory pointers), so what would an inverse interface look like for that? And even if you came up with one, it seems to me that its just 6 of one, half-dozen of the other kind of thing. I mean something like int kvm_ioctl_register_service(struct file *filp, unsigned long arg) { struct file *service = fget(arg); struct kvm *kvm = filp-private_data; if (!service-f_ops-new_xinterface_register) return -EINVAL; return service-f_ops-new_xinterface_register(service, (void*)kvm, kvm_xinterface_ops); } This would assume that we define a new file_operation specifically for this, which would simplify the code, but there are other ways to achieve the same. It would even mean that you don't need any static code as an interface layer. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM_AUTOTEST] set English environment
On Thursday 09 July 2009, Lukáš Doktor wrote: --- orig/client/tests/kvm/control 2009-07-08 13:18:07.0 +0200 +++ new/client/tests/kvm/control2009-07-09 12:32:32.0 +0200 @@ -45,6 +45,8 @@ Each test is appropriately documented on import sys, os +# set English environment +os.environ['LANG'] = 'en_US.UTF-8' # enable modules import from current directory (tests/kvm) pwd = os.path.join(os.environ['AUTODIR'],'tests/kvm') sys.path.append(pwd) LANG can still be overridden with LC_ALL. For a well-defined environment, best set LC_ALL='C'. This will also set other i18n settings and works on systems that don't come with UTF-8 enabled. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
On Thursday 06 August 2009, Gregory Haskins wrote: We can exchange out the virtio-pci module like this: (guest-side) |-- | virtio-net |-- | virtio-ring |-- | virtio-bus |-- | virtio-vbus |-- | vbus-proxy |-- | vbus-connector |-- | (vbus) | |-- | kvm.ko |-- | vbus-connector |-- | vbus |-- | virtio-net-tap (vbus model) |-- | netif |-- (host-side) So virtio-net runs unmodified. What is competing here is virtio-pci vs virtio-vbus. Also, venet vs virtio-net are technically competing. But to say virtio vs vbus is inaccurate, IMO. I think what's confusing everyone is that you are competing on multiple issues: 1. Implementation of bus probing: both vbus and virtio are backed by PCI devices and can be backed by something else (e.g. virtio by lguest or even by vbus). 2. Exchange of metadata: virtio uses a config space, vbus uses devcall to do the same. 3. User data transport: virtio has virtqueues, vbus has shm/ioq. I think these three are the main differences, and the venet vs. virtio-net question comes down to which interface the drivers use for each aspect. Do you agree with this interpretation? Now to draw conclusions from each of these is of course highly subjective, but this is how I view it: 1. The bus probing is roughly equivalent, they both work and the virtio method seems to need a little less code but that could be fixed by slimming down the vbus code as I mentioned in my comments on the pci-to-vbus bridge code. However, I would much prefer not to have both of them, and virtio came first. 2. the two methods (devcall/config space) are more or less equivalent and you should be able to implement each one through the other one. The virtio design was driven by making it look similar to PCI, the vbus design was driven by making it easy to implement in a host kernel. I don't care too much about these, as they can probably coexist without causing any trouble. For a (hypothetical) vbus-in-virtio device, a devcall can be a config-set/config-get pair, for a virtio-in-vbus, you can do a config-get and a config-set devcall and be happy. Each could be done in a trivial helper library. 3. The ioq method seems to be the real core of your work that makes venet perform better than virtio-net with its virtqueues. I don't see any reason to doubt that your claim is correct. My conclusion from this would be to add support for ioq to virtio devices, alongside virtqueues, but to leave out the extra bus_type and probing method. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net: fix vnet_hdr bustage with slirp
On Friday 07 August 2009, Mark McLoughlin wrote: slirp has started using VLANClientState::opaque and this has caused the kvm specific tap_has_vnet_hdr() hack to break because we blindly use this opaque pointer even if it is not a tap client. Add yet another hack to check that we're actually getting called with a tap client. [Needed on stable-0.11 too] Signed-off-by: Mark McLoughlin mar...@redhat.com Jens and I discovered the same bug before, but then we forgot about sending a fix (sorry). Your patch should work fine as a workaround, but I wonder if it is the right solution. The abstraction of struct VLANClientState is otherwise done through function pointers taking the VLANClientState pointer as their first argument. IMHO a cleaner abstraction would be to do the same for tap_has_vnet_hdr(), like the patch below, and similar for other functions passing 'opaque' pointers. Signed-off-by: Arnd Bergmann a...@arndb.de diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 6dfe758..6b34e82 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -123,7 +123,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) VirtIONet *n = to_virtio_net(vdev); VLANClientState *host = n-vc-vlan-first_client; -if (tap_has_vnet_hdr(host)) { +if (host-has_vnet_hdr host-has_vnet_hdr(host)) { tap_using_vnet_hdr(host, 1); features |= (1 VIRTIO_NET_F_CSUM); features |= (1 VIRTIO_NET_F_GUEST_CSUM); @@ -166,7 +166,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) n-mergeable_rx_bufs = !!(features (1 VIRTIO_NET_F_MRG_RXBUF)); #ifdef TAP_VNET_HDR -if (!tap_has_vnet_hdr(host) || !host-set_offload) +if (!(host-has_vnet_hdr host-has_vnet_hdr(host)) || !host-set_offload) return; host-set_offload(host, @@ -398,7 +398,7 @@ static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt, hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE; #ifdef TAP_VNET_HDR -if (tap_has_vnet_hdr(n-vc-vlan-first_client)) { +if ((host-has_vnet_hdr host-has_vnet_hdr(n-vc-vlan-first_client))) { memcpy(hdr, buf, sizeof(*hdr)); offset = sizeof(*hdr); work_around_broken_dhclient(hdr, buf + offset, size - offset); @@ -425,7 +425,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size) return 1; #ifdef TAP_VNET_HDR -if (tap_has_vnet_hdr(n-vc-vlan-first_client)) +if ((host-has_vnet_hdr + host-has_vnet_hdr(n-vc-vlan-first_client))) ptr += sizeof(struct virtio_net_hdr); #endif @@ -529,7 +530,8 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { VirtQueueElement elem; #ifdef TAP_VNET_HDR -int has_vnet_hdr = tap_has_vnet_hdr(n-vc-vlan-first_client); +int has_vnet_hdr = (host-has_vnet_hdr + host-has_vnet_hdr(n-vc-vlan-first_client)); #else int has_vnet_hdr = 0; #endif @@ -620,7 +622,7 @@ static void virtio_net_save(QEMUFile *f, void *opaque) qemu_put_buffer(f, (uint8_t *)n-vlans, MAX_VLAN 3); #ifdef TAP_VNET_HDR -qemu_put_be32(f, tap_has_vnet_hdr(n-vc-vlan-first_client)); +qemu_put_be32(f, (host-has_vnet_hdr host-has_vnet_hdr(n-vc-vlan-first_client))); #else qemu_put_be32(f, 0); #endif diff --git a/net.c b/net.c index 931def1..b56ae78 100644 --- a/net.c +++ b/net.c @@ -754,7 +754,7 @@ static void vmchannel_read(void *opaque, const uint8_t *buf, int size) #ifdef _WIN32 -int tap_has_vnet_hdr(void *opaque) +static int tap_has_vnet_hdr(struct VLANClientState *vc) { return 0; } @@ -906,9 +906,8 @@ static void tap_send(void *opaque) } while (s-size 0); } -int tap_has_vnet_hdr(void *opaque) +static int tap_has_vnet_hdr(struct VLANClientState *vc) { -VLANClientState *vc = opaque; TAPState *s = vc-opaque; return s ? s-has_vnet_hdr : 0; @@ -991,6 +990,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, s-has_vnet_hdr = vnet_hdr != 0; s-vc = qemu_new_vlan_client(vlan, model, name, tap_receive, NULL, tap_cleanup, s); +s-vc-has_vnet_hdr = tap_has_vnet_hdr; s-vc-fd_readv = tap_receive_iov; #ifdef TUNSETOFFLOAD s-vc-set_offload = tap_set_offload; diff --git a/net.h b/net.h index bc42428..7c79734 100644 --- a/net.h +++ b/net.h @@ -21,6 +21,7 @@ struct VLANClientState { IOCanRWHandler *fd_can_read; NetCleanup *cleanup; LinkStatusChanged *link_status_changed; +int (*has_vnet_hdr)(struct VLANClientState *); int link_down; SetOffload *set_offload; void *opaque; @@ -72,7 +73,6 @@ void qemu_handler_true(void *opaque); void do_info_network(Monitor *mon); int do_set_link(Monitor *mon, const char *name, const char *up_or_down); -int tap_has_vnet_hdr(void *opaque); void tap_using_vnet_hdr(void *opaque, int using_vnet_hdr); /* NIC info */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord
Re: [PATCH 1/1] net: fix vnet_hdr bustage with slirp
On Friday 07 August 2009, Mark McLoughlin wrote: The vnet_hdr code in qemu-kvm.git is a hack which we plan to (eventually) replace by allowing a nic to be paired directly with a backend. Your patch is fine, but I'd suggest since both are a hack we stick with mine since it'll reduce merge conflicts. Both hacks will go away eventually, anyway. Ok, sounds good. Thanks, Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Monday 10 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. Very nice, I loved reading it. It's getting rather late in my time zone, so this comments only on the network driver. I'll go through the rest tomorrow. @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) err = PTR_ERR(vblk-vq); goto out_free_vblk; } + printk(KERN_ERR vblk-vq = %p\n, vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (!err) blk_queue_logical_block_size(vblk-disk-queue, blk_size); + printk(KERN_ERR virtio_config_val returned %d\n, err); + add_disk(vblk-disk); return 0; I guess you meant to remove these before submitting. +static void handle_tx_kick(struct work_struct *work); +static void handle_rx_kick(struct work_struct *work); +static void handle_tx_net(struct work_struct *work); +static void handle_rx_net(struct work_struct *work); [style] I think the code gets more readable if you reorder it so that you don't need forward declarations for static functions. +static long vhost_net_reset_owner(struct vhost_net *n) +{ + struct socket *sock = NULL; + long r; + mutex_lock(n-dev.mutex); + r = vhost_dev_check_owner(n-dev); + if (r) + goto done; + sock = vhost_net_stop(n); + r = vhost_dev_reset_owner(n-dev); +done: + mutex_unlock(n-dev.mutex); + if (sock) + fput(sock-file); + return r; +} what is the difference between vhost_net_reset_owner(n) and vhost_net_set_socket(n, -1)? + +static struct file_operations vhost_net_fops = { + .owner = THIS_MODULE, + .release= vhost_net_release, + .unlocked_ioctl = vhost_net_ioctl, + .open = vhost_net_open, +}; This is missing a compat_ioctl pointer. It should simply be static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl, unsigned long arg) { return f, ioctl, (unsigned long)compat_ptr(arg); } +/* Bits from fs/aio.c. TODO: export and use from there? */ +/* + * use_mm + * Makes the calling kernel thread take on the specified + * mm context. + * Called by the retry thread execute retries within the + * iocb issuer's mm context, so that copy_from/to_user + * operations work seamlessly for aio. + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ +static void use_mm(struct mm_struct *mm) +{ + struct mm_struct *active_mm; + struct task_struct *tsk = current; + + task_lock(tsk); + active_mm = tsk-active_mm; + atomic_inc(mm-mm_count); + tsk-mm = mm; + tsk-active_mm = mm; + switch_mm(active_mm, mm, tsk); + task_unlock(tsk); + + mmdrop(active_mm); +} Why do you need a kernel thread here? If the data transfer functions all get called from a guest intercept, shouldn't you already be in the right mm? +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = (struct iovec *)vq-iov + 1, + .msg_flags = MSG_DONTWAIT, + }; + size_t len; + int err; + struct socket *sock = rcu_dereference(net-sock); + if (!sock || !sock_writeable(sock-sk)) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in); + if (head == vq-num) + break; + if (out = 1 || in) { + vq_err(vq, Unexpected descriptor format for TX: +out %d, int %d\n, out, in); + break; + } + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for TX: +%ld expected %zd\n, vq-iov-iov_len, +sizeof(struct virtio_net_hdr)); + break; + } + /* Skip header. TODO: support TSO. */ + msg.msg_iovlen = out - 1; + len = iov_length(vq-iov + 1, out - 1); + /* TODO: Check specific error and bomb out unless ENOBUFS? */ + err = sock-ops-sendmsg(NULL, sock, msg, len); + if (err 0) { +
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Monday 10 August 2009 20:10:44 Michael S. Tsirkin wrote: On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote: what is the difference between vhost_net_reset_owner(n) and vhost_net_set_socket(n, -1)? set socket to -1 will only stop the device. reset owner will let another process take over the device. It also needs to reset all parameters to make it safe for that other process, so in particular the device is stopped. ok I tried explaining this in the header vhost.h - does the comment there help, or do I need to clarify it? No, I just didn't get there yet. I had the impression that if there's no compat_ioctl, unlocked_ioctl will get called automatically. No? It will issue a kernel warning but not call unlocked_ioctl, so you need either a compat_ioctl method or list the numbers in fs/compat_ioctl.c, which I try to avoid. Why do you need a kernel thread here? If the data transfer functions all get called from a guest intercept, shouldn't you already be in the right mm? several reasons :) - I get called under lock, so can't block - eventfd can be passed to another process, and I won't be in guest context at all - this also gets called outside guest context from socket poll - vcpu is blocked while it's doing i/o. it is better to free it up as all the packet copying might take a while Ok. I guess that this is where one could plug into macvlan directly, using sock_alloc_send_skb/memcpy_fromiovec/dev_queue_xmit directly, instead of filling a msghdr for each, if we want to combine this with the work I did on that. quite possibly. Or one can just bind a raw socket to macvlan :) Right, that works as well, but may get more complicated once we try to add zero-copy or other optimizations. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/2] vhost: a kernel-level virtio server
On Wednesday 12 August 2009, Gregory Haskins wrote: Are you saying SRIOV is a requirement, and I can either program the SRIOV adapter with a mac or use promis? Or are you saying I can use SRIOV+programmed mac OR a regular nic + promisc (with a perf penalty). SRIOV is not a requirement. And you can also use a dedicated nic+programmed mac if you are so inclined. Makes sense. Got it. I was going to add guest-to-guest to the test matrix, but I assume that is not supported with vhost unless you have something like a VEPA enabled bridge? If I understand it correctly, you can at least connect a veth pair to a bridge, right? Something like veth0 - veth1 - vhost - guest 1 eth0 - br0-| veth2 - veth3 - vhost - guest 2 It's a bit more complicated than it need to be, but should work fine. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/2] vhost: a kernel-level virtio server
On Wednesday 12 August 2009, Michael S. Tsirkin wrote: If I understand it correctly, you can at least connect a veth pair to a bridge, right? Something like veth0 - veth1 - vhost - guest 1 eth0 - br0-| veth2 - veth3 - vhost - guest 2 Heh, you don't need a bridge in this picture: guest 1 - vhost - veth0 - veth1 - vhost guest 2 Sure, but the setup I described is the one that I would expect to see in practice because it gives you external connectivity. Measuring two guests communicating over a veth pair is interesting for finding the bottlenecks, but of little practical relevance. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Monday 10 August 2009, Michael S. Tsirkin wrote: +struct workqueue_struct *vhost_workqueue; [nitpicking] This could be static. +/* The virtqueue structure describes a queue attached to a device. */ +struct vhost_virtqueue { + struct vhost_dev *dev; + + /* The actual ring of buffers. */ + struct mutex mutex; + unsigned int num; + struct vring_desc __user *desc; + struct vring_avail __user *avail; + struct vring_used __user *used; + struct file *kick; + struct file *call; + struct file *error; + struct eventfd_ctx *call_ctx; + struct eventfd_ctx *error_ctx; + + struct vhost_poll poll; + + /* The routine to call when the Guest pings us, or timeout. */ + work_func_t handle_kick; + + /* Last available index we saw. */ + u16 last_avail_idx; + + /* Last index we used. */ + u16 last_used_idx; + + /* Outstanding buffers */ + unsigned int inflight; + + /* Is this blocked? */ + bool blocked; + + struct iovec iov[VHOST_NET_MAX_SG]; + +} cacheline_aligned; We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. That would make it possible for simple device drivers to use the same driver in both host and guest, similar to how Ira Snyder used virtqueues to make virtio_net run between two hosts running the same code [1]. Ideally, I guess you should be able to even make virtio_net work in the host if you do that, but that could bring other complexities. Arnd [1] http://lkml.org/lkml/2009/2/23/353 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wednesday 12 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 07:03:22PM +0200, Arnd Bergmann wrote: We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. I prefer keeping it simple. Much of abstraction in virtio is due to the fact that it needs to work on top of different hardware emulations: lguest,kvm, possibly others in the future. vhost is always working on real hardware, using eventfd as the interface, so it does not need that. Well, that was my point: virtio can already work on a number of abstractions, so adding one more for vhost should not be too hard. That would make it possible for simple device drivers to use the same driver in both host and guest, I don't think so. For example, there's a callback field that gets invoked in guest when buffers are consumed. It could be overloaded to mean buffers are available in host but you never handle both situations in the same way, so what's the point? ... As I pointed out earlier, most code in virtio net is asymmetrical: guest provides buffers, host consumes them. Possibly, one could use virtio rings in a symmetrical way, but support of existing guest virtio net means there's almost no shared code. The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). virtio_net kicks the transmit virtqueue when it has data and it kicks the receive queue when it has empty buffers to fill, and it has callbacks when the two are done. You can do the same in both the guest and the host, but then the guests input virtqueue is the hosts output virtqueue and vice versa. Once a virtqueue got kicked from both sides, the vhost_virtqueue implementation between the two only needs to do a copy_from_user or copy_to_user (possibly from a thread if it is in atomic context) and then call the two callback functions. This is basically the same thing you do already, except that you use slightly different names for the components. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wednesday 12 August 2009, Anthony Liguori wrote: At any rate, I'd like to see performance results before we consider trying to reuse virtio code. Yes, I agree. I'd also like to do more work on the macvlan extensions to see if it works out without involving a socket. Passing the socket into the vhost_net device is a nice feature of the current implementation that we'd have to give up for something else (e.g. making the vhost a real network interface that you can hook up to a bridge) if it were to use virtio. Unless I can come up with a solution that is clearly superior, I'm taking back my objections on that part for now. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Arnd Bergmann wrote: Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Also, if we do the copy in the transport, it definitely means that we can't get to zero-copy RX/TX from guest space any more. The current vhost_net driver doesn't do that yet, but could be extended in the same way that I'm hoping to do it for macvtap. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote: The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). You need to really squint hard for it to look symmetric. For example, for RX, virtio allocates an skb, puts a descriptor on a ring and waits for host to fill it in. Host system can not do the same: guest does not have access to host memory. You can do a copy in transport to hide this fact, but it will kill performance. Yes, that is what I was suggesting all along. The actual copy operation has to be done by the host transport, which is obviously different from the guest transport that just calls the host using vring_kick(). Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Unfortunately, this also implies that you could no longer simply use the packet socket interface as you do currently, as I realized only now. This obviously has a significant impact on your user space interface. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: The best way to do this IMO would be to add zero copy support to raw sockets, vhost will then get it basically for free. Yes, that would be nice. I wonder if that could lead to security problems on TX though. I guess It will only bring significant performance improvements if we leave the data writable in the user space or guest during the operation. If the user finds the right timing, it could modify the frame headers after they have been checked using netfilter, or while the frames are being consumed in the kernel (e.g. an NFS server running in a guest). Ardn -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: Right now, the number of copy operations in your code is the same. You are doing the copy a little bit later in skb_copy_datagram_iovec(), which is indeed a very nice hack. Changing to a virtqueue based method would imply that the host needs to add each skb_frag_t to its outbound virtqueue, which then gets copied into the guests inbound virtqueue. Which is a lot more code than just calling skb_copy_datagram_iovec. Well, I don't see this part as much of a problem, because the code already exists in virtio_net. If we really wanted to go down that road, just using virtio_net would solve the problem of frame handling entirely, but create new problems elsewhere, as we have mentioned. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Thursday 13 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. AFAICT, you have addressed all my comments, mostly by convincing me that you got it right anyway ;-). I hope this gets into 2.6.32, good work! Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Arnd Bergmann a...@arndb.de One idea though: + /* Parameter checking */ + if (sock-sk-sk_type != SOCK_RAW) { + r = -ESOCKTNOSUPPORT; + goto done; + } + + r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa, +uaddr_len, 0); + if (r) + goto done; + + if (uaddr.sa.sll_family != AF_PACKET) { + r = -EPFNOSUPPORT; + goto done; + } You currently limit the scope of the driver by only allowing raw packet sockets to be passed into the network driver. In qemu, we currently support some very similar transports: * raw packet (not in a release yet) * tcp connection * UDP multicast * tap character device * VDE with Unix local sockets My primary interest right now is the tap support, but I think it would be interesting in general to allow different file descriptor types in vhost_net_set_socket. AFAICT, there are two major differences that we need to handle for this: * most of the transports are sockets, tap uses a character device. This could be dealt with by having both a struct socket * in struct vhost_net *and* a struct file *, or by always keeping the struct file and calling vfs_readv/vfs_writev for the data transport in both cases. * Each transport has a slightly different header, we have - raw ethernet frames (raw, udp multicast, tap) - 32-bit length + raw frames, possibly fragmented (tcp) - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr) To handle these three cases, we need either different ioctl numbers so that vhost_net can choose the right one, or a flags field in VHOST_NET_SET_SOCKET, like #define VHOST_NET_RAW 1 #define VHOST_NET_LEN_HDR 2 #define VHOST_NET_VNET_HDR4 struct vhost_net_socket { unsigned int flags; int fd; }; #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket) If both of those are addressed, we can treat vhost_net as a generic way to do network handling in the kernel independent of the qemu model (raw, tap, ...) for it. Your qemu patch would have to work differently, so instead of qemu -net nic,vhost=eth0 you would do the same as today with the raw packet socket extension qemu -net nic -net raw,ifname=eth0 Qemu could then automatically try to use vhost_net, if it's available in the kernel, or just fall back on software vlan otherwise. Does that make sense? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tuesday 18 August 2009, Gregory Haskins wrote: Avi Kivity wrote: On 08/17/2009 10:33 PM, Gregory Haskins wrote: One point of contention is that this is all managementy stuff and should be kept out of the host kernel. Exposing shared memory, interrupts, and guest hypercalls can all be easily done from userspace (as virtio demonstrates). True, some devices need kernel acceleration, but that's no reason to put everything into the host kernel. See my last reply to Anthony. My two points here are that: a) having it in-kernel makes it a complete subsystem, which perhaps has diminished value in kvm, but adds value in most other places that we are looking to use vbus. b) the in-kernel code is being overstated as complex. We are not talking about your typical virt thing, like an emulated ICH/PCI chipset. Its really a simple list of devices with a handful of attributes. They are managed using established linux interfaces, like sysfs/configfs. IMHO the complexity of the code is not so much of a problem. What I see as a problem is the complexity a kernel/user space interface that manages a the devices with global state. One of the greatest features of Michaels vhost driver is that all the state is associated with open file descriptors that either exist already or belong to the vhost_net misc device. When a process dies, all the file descriptors get closed and the whole state is cleaned up implicitly. AFAICT, you can't do that with the vbus host model. What performance oriented items have been left unaddressed? Well, the interrupt model to name one. The performance aspects of your interrupt model are independent of the vbus proxy, or at least they should be. Let's assume for now that your event notification mechanism gives significant performance improvements (which we can't measure independently right now). I don't see a reason why we could not get the same performance out of a paravirtual interrupt controller that uses the same method, and it would be straightforward to implement one and use that together with all the existing emulated PCI devices and virtio devices including vhost_net. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tuesday 18 August 2009 20:35:22 Michael S. Tsirkin wrote: On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote: Also, in my case I'd like to boot Linux with my rootfs over NFS. Is vhost-net capable of this? I've had Arnd, BenH, and Grant Likely (and others, privately) contact me about devices they are working with that would benefit from something like virtio-over-PCI. I'd like to see vhost-net be merged with the capability to support my use case. There are plenty of others that would benefit, not just myself. yes. I'm not sure vhost-net is being written with this kind of future use in mind. I'd hate to see it get merged, and then have to change the ABI to support physical-device-to-device usage. It would be better to keep future use in mind now, rather than try and hack it in later. I still need to think your usage over. I am not so sure this fits what vhost is trying to do. If not, possibly it's better to just have a separate driver for your device. I now think we need both. virtio-over-PCI does it the right way for its purpose and can be rather generic. It could certainly be extended to support virtio-net on both sides (host and guest) of KVM, but I think it better fits the use where a kernel wants to communicate with some other machine where you normally wouldn't think of using qemu. Vhost-net OTOH is great in the way that it serves as an easy way to move the virtio-net code from qemu into the kernel, without changing its behaviour. It should even straightforward to do live-migration between hosts with and without it, something that would be much harder with the virtio-over-PCI logic. Also, its internal state is local to the process owning its file descriptor, which makes it much easier to manage permissions and cleanup of its resources. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html