from:"Arnd Bergmann"

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-14 Thread Arnd Bergmann

On Friday 09 April 2010, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 Add a device to utilize the vhost-net backend driver for
 copy-less data transfer between guest FE and host NIC.
 It pins the guest user space to the host memory and
 provides proto_ops as sendmsg/recvmsg to vhost-net.

Sorry for taking so long before finding the time to look
at your code in more detail.

It seems that you are duplicating a lot of functionality that
is already in macvtap. I've asked about this before but then
didn't look at your newer versions. Can you explain the value
of introducing another interface to user land?

I'm still planning to add zero-copy support to macvtap,
hopefully reusing parts of your code, but do you think there
is value in having both?

 diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
 new file mode 100644
 index 000..86d2525
 --- /dev/null
 +++ b/drivers/vhost/mpassthru.c
 @@ -0,0 +1,1264 @@
 +
 +#ifdef MPASSTHRU_DEBUG
 +static int debug;
 +
 +#define DBG  if (mp-debug) printk
 +#define DBG1 if (debug == 2) printk
 +#else
 +#define DBG(a...)
 +#define DBG1(a...)
 +#endif

This should probably just use the existing dev_dbg/pr_debug infrastructure.

 [... skipping buffer management code for now]

 +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
 + struct msghdr *m, size_t total_len)
 +{
 [...]

This function looks like we should be able to easily include it into
macvtap and get zero-copy transmits without introducing the new
user-level interface.

 +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
 + struct msghdr *m, size_t total_len,
 + int flags)
 +{
 + struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp;
 + struct page_ctor *ctor;
 + struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb-private);

It smells like a layering violation to look at the iocb-private field
from a lower-level driver. I would have hoped that it's possible to implement
this without having this driver know about the higher-level vhost driver
internals. Can you explain why this is needed?

 + spin_lock_irqsave(ctor-read_lock, flag);
 + list_add_tail(info-list, ctor-readq);
 + spin_unlock_irqrestore(ctor-read_lock, flag);
 +
 + if (!vq-receiver) {
 + vq-receiver = mp_recvmsg_notify;
 + set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
 +vq-num * 4096,
 +vq-num * 4096);
 + }
 +
 + return 0;
 +}

Not sure what I'm missing, but who calls the vq-receiver? This seems
to be neither in the upstream version of vhost nor introduced by your
patch.

 +static void __mp_detach(struct mp_struct *mp)
 +{
 + mp-mfile = NULL;
 +
 + mp_dev_change_flags(mp-dev, mp-dev-flags  ~IFF_UP);
 + page_ctor_detach(mp);
 + mp_dev_change_flags(mp-dev, mp-dev-flags | IFF_UP);
 +
 + /* Drop the extra count on the net device */
 + dev_put(mp-dev);
 +}
 +
 +static DEFINE_MUTEX(mp_mutex);
 +
 +static void mp_detach(struct mp_struct *mp)
 +{
 + mutex_lock(mp_mutex);
 + __mp_detach(mp);
 + mutex_unlock(mp_mutex);
 +}
 +
 +static void mp_put(struct mp_file *mfile)
 +{
 + if (atomic_dec_and_test(mfile-count))
 + mp_detach(mfile-mp);
 +}
 +
 +static int mp_release(struct socket *sock)
 +{
 + struct mp_struct *mp = container_of(sock-sk, struct mp_sock, sk)-mp;
 + struct mp_file *mfile = mp-mfile;
 +
 + mp_put(mfile);
 + sock_put(mp-socket.sk);
 + put_net(mfile-net);
 +
 + return 0;
 +}

Doesn't this prevent the underlying interface from going away while the chardev
is open? You also have logic to handle that case, so why do you keep the extra
reference on the netdev?

 +/* Ops structure to mimic raw sockets with mp device */
 +static const struct proto_ops mp_socket_ops = {
 + .sendmsg = mp_sendmsg,
 + .recvmsg = mp_recvmsg,
 + .release = mp_release,
 +};

 +static int mp_chr_open(struct inode *inode, struct file * file)
 +{
 + struct mp_file *mfile;
 + cycle_kernel_lock();

I don't think you really want to use the BKL here, just kill that line.

 +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
 + unsigned long arg)
 +{
 + struct mp_file *mfile = file-private_data;
 + struct mp_struct *mp;
 + struct net_device *dev;
 + void __user* argp = (void __user *)arg;
 + struct ifreq ifr;
 + struct sock *sk;
 + int ret;
 +
 + ret = -EINVAL;
 +
 + switch (cmd) {
 + case MPASSTHRU_BINDDEV:
 + ret = -EFAULT;
 + if (copy_from_user(ifr, argp, sizeof ifr))
 + break;

This is broken for 32 bit compat mode ioctls, because struct ifreq
is different between 32 and 64 bit systems. Since you are only
using the device name anyway, a fixed length string or just the
interface index would be simpler and work better.

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-14 Thread Arnd Bergmann

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
 On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
  On Friday 09 April 2010, xiaohui@intel.com wrote:
   From: Xin Xiaohui xiaohui@intel.com

  It seems that you are duplicating a lot of functionality that
  is already in macvtap. I've asked about this before but then
  didn't look at your newer versions. Can you explain the value
  of introducing another interface to user land?
 
 Hmm, I have not noticed a lot of duplication.

The code is indeed quite distinct, but the idea of adding another
character device to pass into vhost for direct device access is.

 BTW macvtap also duplicates tun code, it might be
 a good idea for tun to export some functionality.

Yes, that's something I plan to look into.

  I'm still planning to add zero-copy support to macvtap,
  hopefully reusing parts of your code, but do you think there
  is value in having both?
 
 If macvtap would get zero copy tx and rx, maybe not. But
 it's not immediately obvious whether zero-copy support
 for macvtap might work, though, especially for zero copy rx.
 The approach with mpassthru is much simpler in that
 it takes complete control of the device.

As far as I can tell, the most significant limitation of mpassthru
is that there can only ever be a single guest on a physical NIC.

Given that limitation, I believe we can do the same on macvtap,
and simply disable zero-copy RX when you want to use more than one
guest, or both guest and host on the same NIC.

The logical next step here would be to allow VMDq and similar
technologies to separate out the RX traffic in the hardware.
We don't have a configuration interface for that yet, but
since this is logically the same as macvlan, I think we should
use the same interfaces for both, essentially treating VMDq
as a hardware acceleration for macvlan. We can probably handle
it in similar ways to how we handle hardware support for vlan.

At that stage, macvtap would be the logical interface for
connecting a VMDq (hardware macvlan) device to a guest!

   +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec 
   *iov,
   + unsigned long count, loff_t pos)
   +{
   + struct file *file = iocb-ki_filp;
   + struct mp_struct *mp = mp_get(file-private_data);
   + struct sock *sk = mp-socket.sk;
   + struct sk_buff *skb;
   + int len, err;
   + ssize_t result;
  
  Can you explain what this function is even there for? AFAICT, vhost-net
  doesn't call it, the interface is incompatible with the existing
  tap interface, and you don't provide a read function.
 
 qemu needs the ability to inject raw packets into device
 from userspace, bypassing vhost/virtio (for live migration).

Ok, but since there is only a write callback and no read, it won't
actually be able to do this with the current code, right?

Moreover, it seems weird to have a new type of interface here that
duplicates tap/macvtap with less functionality. Coming back
to your original comment, this means that while mpassthru is currently
not duplicating the actual code from macvtap, it would need to do
exactly that to get the qemu interface right!

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-14 Thread Arnd Bergmann

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
   
   qemu needs the ability to inject raw packets into device
   from userspace, bypassing vhost/virtio (for live migration).
  
  Ok, but since there is only a write callback and no read, it won't
  actually be able to do this with the current code, right?
 
 I think it'll work as is, with vhost qemu only ever writes,
 never reads from device. We'll also never need GSO etc
 which is a large part of what tap does (and macvtap will
 have to do).

Ah, I see. I didn't realize that qemu needs to write to the
device even if vhost is used. But for the case of migration to
another machine without vhost, wouldn't qemu also need to read?

  Moreover, it seems weird to have a new type of interface here that
  duplicates tap/macvtap with less functionality. Coming back
  to your original comment, this means that while mpassthru is currently
  not duplicating the actual code from macvtap, it would need to do
  exactly that to get the qemu interface right!
  
 I don't think so, see above. anyway, both can reuse tun.c :)

There is one significant difference between macvtap/mpassthru and
tun/tap in that the directions are reversed. While macvtap and
mpassthru forward data from write into dev_queue_xmit and from
skb_receive into read, tun/tap forwards data from write into
skb_receive and from start_xmit into read.

Also, I'm not really objecting to duplicating code between
macvtap and mpassthru, as the implementation can always be merged.
My main objection is instead to having two different _user_interfaces_
for doing the same thing.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-14 Thread Arnd Bergmann

On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
 On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
  On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
 
 qemu needs the ability to inject raw packets into device
 from userspace, bypassing vhost/virtio (for live migration).

Ok, but since there is only a write callback and no read, it won't
actually be able to do this with the current code, right?
   
   I think it'll work as is, with vhost qemu only ever writes,
   never reads from device. We'll also never need GSO etc
   which is a large part of what tap does (and macvtap will
   have to do).
  
  Ah, I see. I didn't realize that qemu needs to write to the
  device even if vhost is used. But for the case of migration to
  another machine without vhost, wouldn't qemu also need to read?
 
 Not that I know. Why?

Well, if the guest not only wants to send data but also receive
frames coming from other machines, they need to get from the kernel
into qemu, and the only way I can see for doing that is to read
from this device if there is no vhost support around on the new
machine.

Maybe we're talking about different things here.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-14 Thread Arnd Bergmann

On Wednesday 14 April 2010 22:40:03 Michael S. Tsirkin wrote:
 On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
  Well, if the guest not only wants to send data but also receive
  frames coming from other machines, they need to get from the kernel
  into qemu, and the only way I can see for doing that is to read
  from this device if there is no vhost support around on the new
  machine.
  
  Maybe we're talking about different things here.

 mpassthrough is currently useless without vhost.
 If the new machine has no vhost, it can't use mpassthrough :)

Ok. Is that a planned feature though? vhost is currently limited
to guests with a virtio-net driver and even if you extend it to other
guest emulations, it will probably always be a subset of the qemu
supported drivers, but it may be useful to support zero-copy on other
drivers as well.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-15 Thread Arnd Bergmann

On Thursday 15 April 2010, Xin, Xiaohui wrote:
 
 It seems that you are duplicating a lot of functionality that
 is already in macvtap. I've asked about this before but then
 didn't look at your newer versions. Can you explain the value
 of introducing another interface to user land?
 
 I'm still planning to add zero-copy support to macvtap,
 hopefully reusing parts of your code, but do you think there
 is value in having both?
 
 I have not looked into your macvtap code in detail before.
 Does the two interface exactly the same? We just want to create a simple
 way to do zero-copy. Now it can only support vhost, but in future
 we also want it to support directly read/write operations from user space too.

Right now, the features are mostly distinct. Macvtap first of all provides
a tap style interface for users, and can also be used by vhost-net.
It also provides a way to share a NIC among a number of guests by software,
though I indent to add support for VMDq and SR-IOV as well. Zero-copy
is also not yet done in macvtap but should be added.

mpassthru right now does not allow sharing a NIC between guests, and
does not have a tap interface for non-vhost operation, but does the
zero-copy that is missing in macvtap.

 Basically, compared to the interface, I'm more worried about the modification
 to net core we have made to implement zero-copy now. If this hardest part
 can be done, then any user space interface modifications or integrations are 
 more easily to be done after that.

I agree that the network stack modifications are the hard part for zero-copy,
and your work on that looks very promising and is complementary to what I've
done with macvtap. Your current user interface looks good for testing this out,
but I think we should not merge it (the interface) upstream if we can get the
same or better result by integrating your buffer management code into macvtap.

I can try to merge your code into macvtap myself if you agree, so you
can focus on getting the internals right.

 Not sure what I'm missing, but who calls the vq-receiver? This seems
 to be neither in the upstream version of vhost nor introduced by your
 patch.
 
 See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost.

Ok, I see. As a general rule, it's preferred to split a patch series
in a way that makes it possible to apply each patch separately and still
get a working kernel, ideally with more features than the version before
the patch. I believe you could get there by reordering your patches to
make the actual driver the last one in the series.

Not a big problem though, I was mostly looking in the wrong place.

  +  ifr.ifr_name[IFNAMSIZ-1] = '\0';
  +
  +  ret = -EBUSY;
  +
  +  if (ifr.ifr_flags  IFF_MPASSTHRU_EXCL)
  +  break;
 
 Your current use of the IFF_MPASSTHRU* flags does not seem to make
 any sense whatsoever. You check that this flag is never set, but set
 it later yourself and then ignore all flags.
 
 Using that flag is tried to prevent if another one wants to bind the same 
 device
 Again. But I will see if it really ignore all other flags.

The ifr variable is on the stack of the mp_chr_ioctl function, and you never
look at the value after setting it. In order to prevent multiple opens
of that device, you probably need to lock out any other users as well,
and make it a property of the underlying device. E.g. you also want to
prevent users on the host from setting an IP address on the NIC and
using it to send and receive data there.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann

On Friday 23 April 2010, Avi Kivity wrote:
 On 04/23/2010 01:20 PM, Alexander Graf wrote:
 
  I would say the reason is that if we did not convert the user-space 
  pointer to
  a void * kvm_get_dirty_log() would end up copying the dirty log to
 
  (log-dirty_bitmap  32) | 0x
   
  Well yes, that was the problem. If we always set the __u64 value to the 
  pointer we're safe though.
 
  union {
 void *p;
 __u64 q;
  }
 
  void x(void *r)
  {
 // breaks:
 p = r;
 
 // works:
 q = (ulong)r;
  }
 
 
 In that case it's better to avoid p altogether, since users will 
 naturally assign to the pointer.

Right.
 
 Using a 64-bit integer avoids the problem (though perhaps not sufficient 
 for s390, Arnd?)

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.

The simple rule is to never put any 'long' or pointer into data structures
that you pass to an ioctl, and to add padding to multiples of 64 bit to
align the data structure for the x86 alignment problem.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann

On Friday 23 April 2010, Avi Kivity wrote:
 On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
 
  Using a 64-bit integer avoids the problem (though perhaps not sufficient
  for s390, Arnd?)
   
  When there is only a __u64 for the address, it will work on s390 as well,
  gcc is smart enough to clear the upper bit on a cast from long to pointer.
 
 
 Ah, much more convenient than compat_ioctl.  I assume it part of the 
 ABI, not a gccism?

I don't think it's part of the ABI, but it's required to guarantee
that code like this works:

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a, bi = (unsigned long)b;

return ai == bi; /* true if a and b point to the same object */
}

We certainly rely on this already.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann

On Friday 23 April 2010, Avi Kivity wrote:
 Ah so the 31st bit is optional as far as userspace is concerned?  What 
 does it mean? (just curious)

On data pointers it's ignored. When you branch to a function, this bit
determines whether the target function is run in 24 or 31 bit mode.
This allows linking to legacy code on older operating systems that
also support 24 bit libraries.

 What happens on the opposite conversion?  is it restored?
 
 What about
 
 int compare_pointer(void *a, void *b)
 {
 unsigned long ai = (unsigned long)a;
 void *aia = (void *)ai;
 
 return a == b; /* true if a and b point to the same object */
 }

Some instructions set the bit, others clear it, so aia and a may not
be bitwise identical.

 Does gcc mask the big in pointer comparisons as well?

Yes. To stay in the above example:

a == aia;   /* true */
(unsigned long)a == (unsigned long)aia; /* true */
*(unsigned long *)a == *(unsigned long *)aia; /* undefined on s390 */

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-04 Thread Arnd Bergmann

On Tuesday 04 May 2010, Takuya Yoshikawa wrote:
 
 Although we can use *_le_bit() helpers to treat bitmaps le arranged,
 having le bit offset calculation as a seperate macro gives us more freedom.
 
 For example, KVM has le arranged dirty bitmaps for VGA, live-migration
 and they are used in user space too. To avoid bitmap copies between kernel
 and user space, we want to update the bitmaps in user space directly.
 To achive this, le bit offset with *_user() functions help us a lot.
 
 So let us use the le bit offset calculation part by defining it as a new
 macro: generic_le_bit_offset() .

Does this work correctly if your user space is 32 bits (i.e. unsigned long
is different size in user space and kernel) in both big- and little-endian
systems?

I'm not sure about all the details, but I think you cannot in general share
bitmaps between user space and kernel because of this.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH resend 8/12] asm-generic: bitops: introduce le bit offset macro

2010-05-10 Thread Arnd Bergmann

On Monday 10 May 2010, Takuya Yoshikawa wrote:
 (2010/05/06 22:38), Arnd Bergmann wrote:
  On Wednesday 05 May 2010, Takuya Yoshikawa wrote:
  There was a suggestion to propose set_le_bit_user() kind of macros.
  But what I thought was these have a constraint you two explained and 
  seemed to be
  a little bit specific to some area, like KVM.
 
  So I decided to propose just the offset calculation macro.
 
  I'm not sure I understand how this macro is going to be used though.
  If you are just using this in kernel space, that's fine, please go for
  it.
 
 Yes, I'm just using in kernel space: qemu has its own endian related helpers.
 
 So if you allow us to place this macro in asm-generic/bitops/* it will help 
 us.

No problem at all then. Thanks for the explanation.

Acked-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-29 Thread Arnd Bergmann

On Saturday 29 May 2010, Tom Lyon wrote:
 +/*
 + * Structure for DMA mapping of user buffers
 + * vaddr, dmaaddr, and size must all be page aligned
 + * buffer may only be larger than 1 page if (a) there is
 + * an iommu in the system, or (b) buffer is part of a huge page
 + */
 +struct vfio_dma_map {
 +   __u64   vaddr;  /* process virtual addr */
 +   __u64   dmaaddr;/* desired and/or returned dma address */
 +   __u64   size;   /* size in bytes */
 +   int rdwr;   /* bool: 0 for r/o; 1 for r/w */
 +};

Please add a 32 bit padding word at the end of this, otherwise the
size of the data structure is incompatible between 32 x86 applications
and 64 bit kernels.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-16 Thread Arnd Bergmann

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
 vhost-net driver projects

I still think that list should include
- UDP multicast socket support
- TCP socket support
- raw packet socket support for qemu (from Or Gerlitz)

if we have those, plus the tap support that is already on
your list, we can use vhost-net as a generic offload
for the host networking in qemu.

 projects involing networking stack
 - export socket from tap so vhost can use it - working on it now
 - extend raw sockets to support GSO/checksum offloading,
   and teach vhost to use that capability
   [one way to do this: virtio net header support]
   will allow working with e.g. macvlan

One thing I'm planning to work on is bridge support in macvlan,
together with VEPA compliant operation, i.e. not sending back
multicast frames to the origin.

I'll also keep looking into macvtap, though that will be less
important once you get the tap socket support running.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Arnd Bergmann

On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
 Userspace in x86 maps a PCI region, uses it for communication with ppc?

This might have portability issues. On x86 it should work, but if the
host is powerpc or similar, you cannot reliably access PCI I/O memory
through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
calls, which don't work on user pointers.

Specifically on powerpc, copy_from_user cannot access unaligned buffers
if they are on an I/O mapping.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-16 Thread Arnd Bergmann

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
 On Wed, Sep 16, 2009 at 04:52:40PM +0200, Arnd Bergmann wrote:
  On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
   vhost-net driver projects
  
  I still think that list should include
 
 Yea, why not. Go wild.
 
  - UDP multicast socket support
  - TCP socket support
 
 Switch to UDP unicast while we are at it?
 tunneling raw packets over TCP looks wrong.

Well, TCP is what qemu supports right now, that's why
I added it to the list. We could add UDP unicast as
yet another protocol in both qemu and vhost_net if there
is demand for it. The implementation should be trivial
based on the existing code paths.

  One thing I'm planning to work on is bridge support in macvlan,
  together with VEPA compliant operation, i.e. not sending back
  multicast frames to the origin.
 
 is multicast filtering already there (i.e. only getting
 frames for groups you want)?

No, I think this is less important, because the bridge code
also doesn't do this.
 
  I'll also keep looking into macvtap, though that will be less
  important once you get the tap socket support running.
 
 Not sure I see the connection. to get an equivalent to macvtap,
 what you need is tso etc support in packet sockets. No?

I'm not worried about tso support here.

One of the problems that raw packet sockets have is the requirement
for root permissions (e.g. through libvirt). Tap sockets and
macvtap both don't have this limitation, so you can use them as
a regular user without libvirt.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Arnd Bergmann

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
 On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
  On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
   Userspace in x86 maps a PCI region, uses it for communication with ppc?
  
  This might have portability issues. On x86 it should work, but if the
  host is powerpc or similar, you cannot reliably access PCI I/O memory
  through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
  calls, which don't work on user pointers.
  
  Specifically on powerpc, copy_from_user cannot access unaligned buffers
  if they are on an I/O mapping.
  
 We are talking about doing this in userspace, not in kernel.

Ok, that's fine then. I thought the idea was to use the vhost_net driver
to access the user memory, which would be a really cute hack otherwise,
as you'd only need to provide the eventfds from a hardware specific
driver and could use the regular virtio_net on the other side.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-16 Thread Arnd Bergmann

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
  
  No, I think this is less important, because the bridge code
  also doesn't do this.
 
 True, but the reason might be that it is much harder in bridge (you have
 to snoop multicast registrations). With macvlan you know which
 multicasts does each device want.

Right. It shouldn't be hard to do, and I'll probably get to
that after the other changes.

  One of the problems that raw packet sockets have is the requirement
  for root permissions (e.g. through libvirt). Tap sockets and
  macvtap both don't have this limitation, so you can use them as
  a regular user without libvirt.
 
 I don't see a huge difference here.
 If you are happy with the user being able to bypass filters in host,
 just give her CAP_NET_RAW capability.  It does not have to be root.

Capabilities are nice in theory, but I've never seen them being used
effectively in practice, where it essentially comes down to some
SUID wrapper. Also, I might not want to allow the user to open a
random random raw socket, but only one on a specific downstream
port of a macvlan interface, so I can filter out the data from
that respective MAC address in an external switch.

That scenario is probably not so relevant for KVM, unless you
consider the guest taking over the qemu host process a valid
security threat.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-17 Thread Arnd Bergmann

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
  Also, I might not want to allow the user to open a
  random random raw socket, but only one on a specific downstream
  port of a macvlan interface, so I can filter out the data from
  that respective MAC address in an external switch.
 
 I agree. Maybe we can fix that for raw sockets, want me to
 add it to the list? :)

So far, I could not find any theoretical solution how to fix this,
but if you think it can be done, it would be good to have it on the
list somewhere.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-17 Thread Arnd Bergmann

On Thursday 17 September 2009, Michael S. Tsirkin wrote:
 On Thu, Sep 17, 2009 at 01:30:00PM +0200, Arnd Bergmann wrote:
  On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
Also, I might not want to allow the user to open a
random random raw socket, but only one on a specific downstream
port of a macvlan interface, so I can filter out the data from
that respective MAC address in an external switch.
   
   I agree. Maybe we can fix that for raw sockets, want me to
   add it to the list? :)
  
  So far, I could not find any theoretical solution how to fix this,
 
 What if socket had a LOCKBIND ioctl after which you can not bind it to
 any other device?  Then someone with RAW capability can open the socket,
 bind to device and hand it to you. You can send packets but not
 switch to another device.

Could work, though I was hoping for a solution that does not depend
on a priviledged task at run time to open the socket, as you have with
persistant tap devices or chardevs like macvtap that can have their
persissions set by udev.


Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net todo list

2009-09-17 Thread Arnd Bergmann

On Thursday 17 September 2009, Michael S. Tsirkin wrote:
 Well, we could have a char device with an ioctl that gives you back a socket,
 or maybe even have it give you back a socket when you open it.
 Will that make you happy?

Well, that would put is in the exact same spot as the tun/tap driver patch
you're working on or my (still unfinished, I need to spend some time on it
again) macvtap driver. As I said, either one addresses the problem but is
unrelated to the raw socket interface.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-22 Thread Arnd Bergmann

On Tuesday 22 September 2009, Michael S. Tsirkin wrote:
   More importantly, when virtualizations is used with multi-queue
   NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
   NIC should preserve the parallelism (lock free) using multiple
   receive/transmit queues. The number of queues should equal the
   number of CPUs.
  
  Yup, multiqueue virtio is on todo list ;-)
  
 
 Note we'll need multiqueue tap for that to help.

My idea for that was to open multiple file descriptors to the same
macvtap device and let the kernel figure out the  right thing to
do with that. You can do the same with raw packed sockets in case
of vhost_net, but I wouldn't want to add more complexity to the
tun/tap driver for this.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-22 Thread Arnd Bergmann

On Tuesday 22 September 2009, Stephen Hemminger wrote:
  My idea for that was to open multiple file descriptors to the same
  macvtap device and let the kernel figure out the  right thing to
  do with that. You can do the same with raw packed sockets in case
  of vhost_net, but I wouldn't want to add more complexity to the
  tun/tap driver for this.
  
 Or get tap out of the way entirely. The packets should not have
 to go out to user space at all (see veth)

How does veth relate to that, do you mean vhost_net? With vhost_net,
you could still open multiple sockets, only the access is in the kernel.
Obviously, once it all is in the kernel, that could be done under the
covers, but I think it would be cleaner to treat vhost_net purely as
a way to bypass the syscalls for user space, with as little as possible
visible impact otherwise.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Release plan for 0.12.0

2009-10-14 Thread Arnd Bergmann

On Thursday 08 October 2009, Anthony Liguori wrote:
 Jens Osterkamp wrote:
  On Wednesday 30 September 2009, Anthony Liguori wrote:
 
  Please add to this list and I'll collect it all and post it somewhere.
  
 
  What about Or Gerlitz' raw backend driver ? I did not see it go in yet, or 
  did 
  I miss something ?

 
 The patch seems to have not been updated after the initial posting and 
 the first feedback cycle.
 
 I'm generally inclined to oppose the functionality as I don't think it 
 offers any advantages over the existing backends.

There are two reasons why I think this backend is important:

- As an easy way to provide isolation between guests (private ethernet
  port aggregator, PEPA) and external enforcement of network priviledges
  (virtual ethernet port aggregator, VEPA) using the macvlan subsystem.

- As a counterpart to the vhost_net driver, providing an identical
  user interface with or without vhost_net acceleration in the kernel.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/27] Add KVM support for Book3s_64 (PPC64) hosts v5

2009-10-22 Thread Arnd Bergmann

On Wednesday 21 October 2009, Alexander Graf wrote:
 
 KVM for PowerPC only supports embedded cores at the moment.
 
 While it makes sense to virtualize on small machines, it's even more fun
 to do so on big boxes. So I figured we need KVM for PowerPC64 as well.
 
 This patchset implements KVM support for Book3s_64 hosts and guest support
 for Book3s_64 and G3/G4.
 
 To really make use of this, you also need a recent version of qemu.
 
 
 Don't want to apply patches? Get the git tree!
 
 $ git clone git://csgraf.de/kvm
 $ git checkout origin/ppc-v4

Whole series Acked-by: Arnd Bergmann a...@arndb.de

Great work, Alex!

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-net patches

2009-10-28 Thread Arnd Bergmann

On Monday 26 October 2009, Shirley Ma wrote:
 On Sun, 2009-10-25 at 11:11 +0200, Michael S. Tsirkin wrote:
  What is vnet0?
 
 That's a tap interface. I am binding raw socket to a tap interface and
 it doesn't work. Does it support?


Is the tap device connected to a bridge as you'd normally do
with qemu? That won't work because then the data you send to
the socket will be queued at the /dev/tun chardev.

You can probably connect it like this:

qemu - vhost_net - vnet0 == /dev/tun - qemu

To connect two guests.

I've also used a bidirectional pipe before, to connect two tap
interfaces to each other. However, if you want to connect to
a bridge, the easier interface would be to use a veth pair,
with one end on the bridge and the other end used for the packet
socket.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 1/3] tun: export underlying socket

2009-11-03 Thread Arnd Bergmann

On Monday 02 November 2009, Michael S. Tsirkin wrote:
 Tun device looks similar to a packet socket
 in that both pass complete frames from/to userspace.
 
 This patch fills in enough fields in the socket underlying tun driver
 to support sendmsg/recvmsg operations, and message flags
 MSG_TRUNC and MSG_DONTWAIT, and exports access to this socket
 to modules.  Regular read/write behaviour is unchanged.
 
 This way, code using raw sockets to inject packets
 into a physical device, can support injecting
 packets into host network stack almost without modification.
 
 First user of this interface will be vhost virtualization
 accelerator.

You mentioned before that you wanted to export the socket
using some ioctl function returning an open file descriptor,
which seemed to be a cleaner approach than this one.

What was your reason for changing?

 index 3f5fd52..404abe0 100644
 --- a/include/linux/if_tun.h
 +++ b/include/linux/if_tun.h
 @@ -86,4 +86,18 @@ struct tun_filter {
 __u8   addr[0][ETH_ALEN];
  };
  
 +#ifdef __KERNEL__
 +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 +struct socket *tun_get_socket(struct file *);
 +#else
 +#include linux/err.h
 +#include linux/errno.h
 +struct file;
 +struct socket;
 +static inline struct socket *tun_get_socket(struct file *f)
 +{
 +   return ERR_PTR(-EINVAL);
 +}
 +#endif /* CONFIG_TUN */
 +#endif /* __KERNEL__ */
  #endif /* __IF_TUN_H */

Is this a leftover from testing? Exporting the function for !__KERNEL__
seems pointless.

Arnd 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/27] Add book3s_64 specific opcode emulation

2009-11-04 Thread Arnd Bergmann

On Tuesday 03 November 2009, Benjamin Herrenschmidt wrote:
 (Though glibc can be nasty, afaik it might load up optimized variants of
 some routines with hard wired cache line sizes based on the CPU type)

You can also get application with hand-coded cache optimizations
that are even harder, if not impossible, to fix.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 1/3] tun: export underlying socket

2009-11-04 Thread Arnd Bergmann

On Tuesday 03 November 2009, Arnd Bergmann wrote:
  index 3f5fd52..404abe0 100644
  --- a/include/linux/if_tun.h
  +++ b/include/linux/if_tun.h
  @@ -86,4 +86,18 @@ struct tun_filter {
  __u8   addr[0][ETH_ALEN];
   };
   
  +#ifdef __KERNEL__
  +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
  +struct socket *tun_get_socket(struct file *);
  +#else
  +#include linux/err.h
  +#include linux/errno.h
  +struct file;
  +struct socket;
  +static inline struct socket *tun_get_socket(struct file *f)
  +{
  +   return ERR_PTR(-EINVAL);
  +}
  +#endif /* CONFIG_TUN */
  +#endif /* __KERNEL__ */
   #endif /* __IF_TUN_H */
 
 Is this a leftover from testing? Exporting the function for !__KERNEL__
 seems pointless.
 

Michael, you didn't reply on this comment and the code is still there in v8.
Do you actually need this? What for?

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Arnd Bergmann

On Thursday 10 December 2009, Avi Kivity wrote:
 Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink 
 /usr/local/include/kvm-kmod.

Depends on how fine-grained you want to do the packaging.
Most distributions split packages between code and development
packages. The kvm-kmod code is the kernel module, so you want
to be able to install it for multiple kernels simultaneously.

Building the package only requires one version of the header
and does not depend on the underlying kernel version, only
on the version of the module, so it's reasonable to install only
one version as the -dev package, and have a dependency
in there to match the module version with the header version.

The most complex setup would split the development package
into one per kernel version and/or module version, plus an
extra package for the module version containing only the
symlink. I wouldn't go there.

  It may also be useful to do the equivalent of 'make headers_install'
  from the kernel, to remove all #ifdef __KERNEL__ sections and
  sparse annotations from the header files, but it should also work
  without that.
 
 
 Well, qemu.git needs __user removed.

This one is taken care of by kvm_kmod in the sync script, though it
would be cleaner to only do it for the installed version of the header,
not for the one used to build kvm.ko.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Arnd Bergmann

On Thursday 10 December 2009 17:14:40 Jan Kiszka wrote:
 Avi Kivity wrote:
  On 12/10/2009 06:42 PM, Jan Kiszka wrote:
  I've just (forced-)pushed the simple version with
  /usr/include/kvm-kmod as destination. The user headers are now stored
  under usr/include in the kvm-kmod sources and installed from there.
 
  
  It's customary to install to /usr/local, not to /usr (qemu does the same).

Right. Specifically, an install from source should go to /usr/local/include
by default, while a distro package should override the path to go to
/usr/include, which the current version easily allows.

This also means that qemu will have to look in three places now,
/usr/local/include/kvm-kmod, /usr/include/kvm-kmod and /usr/include.
Adding /usr/local/include probably doesn't hurt but should not be
necessary.

 Adjusted accordingly. Moreover, I only install the target arch's header now.

Looks good now.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Host-guest channel interface advice needed

2008-11-26 Thread Arnd Bergmann

On Wednesday 26 November 2008, Gleb Natapov wrote:
 The interfaces that are being considered are netlink socket (only datagram
 semantics, linux specific), new socket family or character device with
 different minor number for each channel. Which one better suits for
 the purpose?  Is there other kind of interface to consider? New socket
 family looks like a good choice, but it would be nice to hear other
 opinions before starting to work on it.

I think a socket and a pty both look reasonable here, but one important
aspect IMHO is that you only need a new kernel driver for the guest, if
you just use the regular pty support or Unix domain sockets in the host.

Obviously, there needs to be some control over permissions, as a guest
most not be able to just open any socket or pty of the host, so a
reasonable approach might be that the guest can only create a socket
or pty that can be opened by the host, but not vice versa. Alternatively,
you create the socket/pty in host userspace and then allow passing that
down into the guest, which creates a virtio device from it.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does macvtap support host to guest communication?

2011-04-18 Thread Arnd Bergmann

On Monday 18 April 2011, Asias He wrote:
 
 Hi, folks
 
 I am trying to use qemu/qemu-kvm with macvtap using following commands:
 
 # ip link add link eth0 name v0 type macvtap mode {vepa,bridge,private}
 # ip link set v0 address da:4e:17:88:42:b1 up
 # idx=`ip link show v0 | grep mtu| awk -F: '{print $1}'`
 # kvm -net nic,macaddr=da:4e:17:88:42:b1 -net tap,fd=3 -hda
 /home/asias/qemu-stuff/sid.img  3/dev/tap${idx}
 
 I found that guest can access other hosts on the LAN except the host
 where guest lives, and host where guest lives can not access guest.
 
 My question is: Does macvtap support host(hypervisor host) to guest
 communication?
 

You can communicate between macvtap and macvlan devices when they are in
bridge mode, but these devices cannot communicate with clients that
run on the underlying device.

Just add a macvlan device to your hardware interface and use that in
the host instead of running on the low-level device directly.

The other option is to use a vepa enabled bridge, but these are relatively
rare.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does macvtap support host to guest communication?

2011-04-18 Thread Arnd Bergmann

On Monday 18 April 2011, Asias He wrote:
 (1) Is it possible to add an interface to macvtap like /dev/net/tun,
 eg, /dev/net/macvtap. Currently, it is hard to use macvtap programmatically.

I decided against having a multiplexor device because it makes permission
handling rather hard. One chardev per network interface makes it possible
to handle permissions in multiuser setups.

 (2) Adding another macvlan device(e.g., macvlan0) to the hardware
 interface(e.g., eth0) and using it as the old eth0 make the process of
 using macvtap complicate. One has to reconfigure the network. This is
 not optimal from the user perspective. Is it possible to leave the
 low-level device as is when using the macvtap device?

Only in VEPA mode. Note that a similar restriction applies when using
the bridge device, for the same technical reasons.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does macvtap support host to guest communication?

2011-04-18 Thread Arnd Bergmann

On Monday 18 April 2011, Ingo Molnar wrote:
  Only in VEPA mode. Note that a similar restriction applies when using the 
  bridge device, for the same technical reasons.
 
 Just to sum things up, our goal is to allow the tools/kvm/ unprivileged tool 
 to 
 provide TCP connectivity to Linux guests transparently, with the following 
 parameters:
 
  - the kvm tool runs unprivileged - as ordinary user
 
  - without having to configure much (preferably zero configuration: without 
having to configure anything) on the guest Linux side
 
  - multiple guests should just work without interfering with each other
 
  - the kvm tool wants to be stateless - i.e. it does not want to allocate or 
manage host side devices - it just wants to provide the kind of TCP/IP 
connectivity host unprivileged user-space has, to the guest. The tool 
 wants 
to be a generic tool with no global state, not a daemon.
 
 So it wants to be a stateless, unprivileged and zero-configuration solution.

 Is this possible with macvtap, and if yes, what kind of macvtap mode and 
 usage 
 would you recommend for that goal?

With the above requirements, I would suggest using something like the the qemu
user networking. This is slower and does not allow servers to be present in
the guest, but those are not your goal as it seems.

The primary goals of macvtap are to allow efficient networking (zero-copy,
multi-queue, although we're not completely there yet) and proper security
abstractions.

If you want a guest to appear on the same network as the host, you can not
do that without privileges to manage the host network setup, and I guess that
will have to stay that way.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does macvtap support host to guest communication?

2011-04-18 Thread Arnd Bergmann

On Monday 18 April 2011, Asias He wrote:
 We do need guest appearing on the same network as the host support as
 well. The reason I am considering using macvatp instead of tap plus
 brctl is that it simplifies the bridge configuration and it is more
 efficient.

Right, you certainly don't need to consider tap/brctl any more.

 However, IMHO, the interface of macvtap is not user friendly, at least
 for me. I have no idea about the technical reasons that make the
 low-level device inaccessible. But if it is accessible, a lot of
 configuration can be eliminated. I know virtualbox's bridge mode has
 this kind of restriction, while VMware's bridge mode does not.

The main reason is that having a MAC address scan in the regular
networking core would make the common TX case where there is no
macvlan device more complex. Macvtap is derived from the plain
macvlan driver, which used to support only sending out to the
wire until I added the optional bridge mode.

If you want a regular device to be able to send to a macvlan
port, that would require at least these changes:

* Add an option to put a plain device into macvlan-bridge mode
* Add support for that option into iproute2
* Add a hook into dev_queue_xmit() to check for macvlan ports

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does macvtap support host to guest communication?

2011-04-19 Thread Arnd Bergmann

On Monday 18 April 2011, Asias He wrote:
  
  If you want a regular device to be able to send to a macvlan
  port, that would require at least these changes:
  
  * Add an option to put a plain device into macvlan-bridge mode
  * Add support for that option into iproute2
  * Add a hook into dev_queue_xmit() to check for macvlan ports
 
 Cool! Arnd, mind to add this feature to macvtap?

No, not after I just explained why I haven't done it before
and why it's so controversial.

Also, I have moved on to other projects and am no longer doing
active development of the macvtap driver. I'd be happy to
pass on the ownership to someone else and help him or her extend
it.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-09 Thread Arnd Bergmann

On Tuesday 08 June 2010, Randy Dunlap wrote:
  Documentation/ioctl/ioctl-number.txt |1 
  Documentation/vfio.txt   |  177 +++
  MAINTAINERS  |7 
  drivers/Kconfig  |2 
  drivers/Makefile |1 
  drivers/vfio/Kconfig |   18 
  drivers/vfio/Makefile|6 
  drivers/vfio/uiommu.c|  126 +
  drivers/vfio/vfio_dma.c  |  324 
  drivers/vfio/vfio_intrs.c|  191 +++
  drivers/vfio/vfio_main.c |  624 +
  drivers/vfio/vfio_pci_config.c   |  554 ++
  drivers/vfio/vfio_rdwr.c |  147 +
  drivers/vfio/vfio_sysfs.c|  153 ++
  include/linux/uiommu.h   |   62 ++
  include/linux/vfio.h |  200 
  16 files changed, 2593 insertions(+)

This seems to be missing a change to include/linux/Kbuild that
adds vfio.h to the exported files. Without the export, you cannot
use the definitions from user space programs unless they come with
their own copy of the header.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Arnd Bergmann

On Wednesday 16 June 2010, Markus Armbruster wrote:
 Can't hurt reviewer motivation.  Could it be automated?  Find replies,
 extract tags.  If you want your acks to be picked up, you better make
 sure your References header works, and your tags are formatted
 correctly.

I think pwclient (https://patchwork.kernel.org/) can do this for you.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm-s390: Dont exit SIE on SIGP sense running

2010-06-21 Thread Arnd Bergmann

On Monday 21 June 2010, Christian Borntraeger wrote:
 Hmm, dont know. Currently this calls into a s390 debug tracing facility
 (arch/s390/kernel/debug.c) which is heavily used by our service folks.
 There are commands for crash and lcrash to show these s390 debug traces
 from a dump.
 
 Maybe its worth to investigate if we should change some of these events to
 have both ftrace-tracepoints and the debug traces.

I think that it would be worthwhile to convert the entire s390 debug
code to become tracepoints, either one by one or making it a subclass
with the existing interfaces.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-07-31 Thread Arnd Bergmann

On Friday 30 July 2010 17:51:52 Shirley Ma wrote:
 On Fri, 2010-07-30 at 16:53 +0800, Xin, Xiaohui wrote:
  Since vhost-net already supports macvtap/tun backends, do you think
  whether it's better to implement zero copy in macvtap/tun than
  inducing
  a new media passthrough device here?
  
  
  I'm not sure if there will be more duplicated code in the kernel. 
 
 I think it should be less duplicated code in the kernel if we use
 macvtap to support what media passthrough driver here. Since macvtap has
 support virtio_net head and offloading already, the only missing func is
 zero copy. Also QEMU supports macvtap, we just need add a zero copy flag
 in option.

Yes, I fully agree and that was one of the intended directions for
macvtap to start with. Thank you so much for following up on that,
I've long been planning to work on macvtap zero-copy myself but it's
now lower on my priorities, so it's good to hear that you made progress
on it, even if there are still performance issues.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.

2010-08-04 Thread Arnd Bergmann

On Wednesday 04 August 2010, Dong, Eddie wrote:
 Arnd Bergmann wrote:
  On Friday 30 July 2010 17:51:52 Shirley Ma wrote:
  I think it should be less duplicated code in the kernel if we use
  macvtap to support what media passthrough driver here. Since macvtap
  has support virtio_net head and offloading already, the only missing
  func is zero copy. Also QEMU supports macvtap, we just need add a
  zero copy flag in option.
  
  Yes, I fully agree and that was one of the intended directions for
  macvtap to start with. Thank you so much for following up on that,
  I've long been planning to work on macvtap zero-copy myself but it's
  now lower on my priorities, so it's good to hear that you made
  progress on it, even if there are still performance issues.
  
 
 But zero-copy is a Linux generic feature that can be used by other
 VMMs as well if the BE service drivers want to incorporate.  If we
 can make mp device VMM-agnostic (it may be not yet in current patch),
 that will help Linux more.

But the tun/tap protocol is what most hypervisors use today on Linux,
and one of the design goals of macvtap was to keep that interface
so that everyone gets the features like zero-copy if that is added
to macvtap. The mp device interface is currently not supported by
anything else than vhost with these patches, and making it more
generic would turn the interface into a copy of macvtap.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add definitions for current cpu models..

2010-01-20 Thread Arnd Bergmann

On Monday 18 January 2010, john cooper wrote:
 +.name = Conroe,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,

 that looks wrong -- what is model 2 actually?

 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 | CPUID_EXT_SSSE3,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Celeron_4x0 (Conroe/Merom Class Core 2),
 +},

Celeron_4x0 is a rather bad example, because it is based on the 
single-core Conroe-L, which is family 6 / model 22 unlike all the dual-
and quad-core Merom/Conroe that are model 15.

 +{
 +.name = Penryn,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,
 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 |
 +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Core 2 Duo P9xxx (Penryn Class Core 2),
 +},

This would be model 23 for Penryn-class Xeon/Core/Pentium/Celeron processors
without L3 cache.

 +{
 +.name = Nehalem,
 +.level = 2,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 +.family = 6,   /* P6 */
 +.model = 2,
 +.stepping = 3,
 +.features = PPRO_FEATURES | 
 +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |/* note 1 */
 +CPUID_PSE36,/* note 2 */
 +.ext_features = CPUID_EXT_SSE3 |
 +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 |
 +CPUID_EXT_SSE42 | CPUID_EXT_POPCNT,
 +.ext2_features = (PPRO_FEATURES  CPUID_EXT2_MASK) | 
 +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 +.ext3_features = CPUID_EXT3_LAHF_LM,
 +.xlevel = 0x800A,
 +.model_id = Intel Core i7 9xx (Nehalem Class Core i7),
 +},

Apparently, not all the i7-9xx CPUs are Nehalem, the i7-980X is supposed
to be Westmere, which has more features.

Because of the complexity, I'd recommend passing down the *model* number
of the emulated CPU, the interesting Intel ones (those supported by KVM) being:

15-6: CedarMill/Presler/Dempsey/Tulsa (Pentium 4/Pentium D/Xeon 50xx/Xeon 71xx)
6-14: Yonah/Sossaman (Celeron M4xx, Core Solo/Duo, Pentium Dual-Core T1000, 
Xeon ULV)
6-15: Merom/Conroe/Kentsfield/Woodcrest/Clovertown/Tigerton
  (Celeron M5xx/E1xxx/T1xxx, Pentium T2xxx/T3xxx/E2xxx,Core 2 Solo U2xxx,
   Core 2 Duo E4xxx/E6xxx/Q6xxx/T5xxx/T7xxx/L7xxx/U7xxx/SP7xxx,
   Xeon 30xx/32xx/51xx/52xx/72xx/73xx)
6-22: Penryn/Wolfdale/Yorkfield/Harpertown (Celeron 7xx/9xx/SU2xxx/T3xxx/E3xxx,
   Pentium T4xxx/SU2xxx/SU4xxx/E5xxx/E6xxx, Core 2 Solo SU3xxx,
   Core 2 Duo P/SU/T6xxx/x8xxx/x9xxx,
   Xeon 31xx/33xx/52xx/54xx)
6-26: Gainestown/Bloomfield (Xeon 35xx/55xx, Core i7-9xx)
6-28: Atom
6-29: Dunnington (Xeon 74xx)
6-30: Lynnfield/Clarksfield/JasperForest (Xeon 34xx, Core i7-8xx, Core i7-xxxQM,
   Core i5-7xx)
6-37: Arrandale/Clarkdale (Dual-Core Core i3/i5/i7)
6-44: Gulftown (six-core)

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-26 Thread Arnd Bergmann

On Wednesday 27 January 2010, Anthony Liguori wrote:
  The raw backend can be attached to a physical device
 
 This is equivalent to bridging with tun/tap except that it has the 
 unexpected behaviour of unreliable host/guest networking (which is not 
 universally consistent across platforms either).  This is not a mode we 
 want to encourage users to use.

It's not the most common scenario, but I've seen systems (I remember
one on s/390 with z/VM) where you really want to isolate the guest
network as much as possible from the host network. Besides PCI
passthrough, giving the host device to a guest using a raw socket
is the next best approximation of that.

Then again, macvtap will do that too, if the device driver supports
multiple unicast MAC addresses without forcing promiscous mode.

  , macvlan
 
 macvtap is a superior way to achieve this use case because a macvtap fd 
 can safely be given to a lesser privilege process without allowing 
 escalation of privileges.

Yes.

or SR-IOV VF.
 
 
 This depends on vhost-net.

Why? I don't see anything in this scenario that is vhost-net specific.
I also plan to cover this aspect in macvtap in the future, but the current
code does not do it yet. It also requires device driver changes.

   In general, what I would like to see for 
 this is something more user friendly that dealt specifically with this 
 use-case.  Although honestly, given the recent security concerns around 
 raw sockets, I'm very concerned about supporting raw sockets in qemu at all.
 
 Essentially, you get worse security doing vhost-net + raw + VF then with 
 PCI passthrough + VF because at least in the later case you can run qemu 
 without privileges.  CAP_NET_RAW is a very big privilege.

It can be contained to a large degree with network namespaces. When you
run qemu in its own namespace and add the VF to that, CAP_NET_RAW
should ideally have no effect on other parts of the system (except
bugs in the namespace implementation).

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-27 Thread Arnd Bergmann

On Wednesday 27 January 2010, Michael S. Tsirkin wrote:
 I am not sure I agree with this sentiment.  The main issue being that
 macvtap doesn't exist on all kernels :). macvlan also requires hardware
 support, packet socket can work with any network card in promisc mode.

To be clear, macvlan does not require hardware support, it will happily
put cards into promiscous mode if they don't support multiple mac addresses.

 I agree to that. People don't even seem to agree whether it's a raw
 socket or a packet socket :) We need a better name for this option: what
 it really does is rely on an external device to loopback a packet to us,
 so how about -net loopback or -net extbridge?

I think -net socket,fd should just be (trivially) extended to work with raw
sockets out of the box, with no support for opening it. Then you can have
libvirt or some wrapper open a raw socket and a private namespace and just pass 
it
down. If you really want to let qemu open the socket itself, -net 
socket,raw=eth0
is probably closer to what you want than a new -net xxx option.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-27 Thread Arnd Bergmann

On Wednesday 27 January 2010, Anthony Liguori wrote:
  I think -net socket,fd should just be (trivially) extended to work with raw
  sockets out of the box, with no support for opening it. Then you can have
  libvirt or some wrapper open a raw socket and a private namespace and just 
  pass it
  down.
   
  That'd work. Anthony?
 
 The fundamental problem that I have with all of this is that we should 
 not be introducing new network backends that are based around something 
 only a developer is going to understand.  If I'm a user and I want to 
 use an external switch in VEPA mode, how in the world am I going to know 
 that I'm supposed to use the -net raw backend or the -net socket 
 backend?  It might as well be the -net butterflies backend as far as a 
 user is concerned.

My point is that we already have -net socket,fd and any user that passes
an fd into that already knows what he wants to do with it. Making it
work with raw sockets is just a natural extension to this, which works
on all kernels and (with separate namespaces) is reasonably secure.

I fully agree that we should not introduce further network backends
that would confuse users, but making the existing backends more
flexible is something entirely different.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-27 Thread Arnd Bergmann

On Wednesday 27 January 2010, Sridhar Samudrala wrote:
 On Wed, 2010-01-27 at 22:39 +0100, Arnd Bergmann wrote:
  On Wednesday 27 January 2010, Anthony Liguori wrote:
I think -net socket,fd should just be (trivially) extended to work 
with raw
sockets out of the box, with no support for opening it. Then you can 
have
libvirt or some wrapper open a raw socket and a private namespace and 
just pass it
down.
 
That'd work. Anthony?
   
   The fundamental problem that I have with all of this is that we should 
   not be introducing new network backends that are based around something 
   only a developer is going to understand.  If I'm a user and I want to 
   use an external switch in VEPA mode, how in the world am I going to know 
   that I'm supposed to use the -net raw backend or the -net socket 
   backend?  It might as well be the -net butterflies backend as far as a 
   user is concerned.
  
  My point is that we already have -net socket,fd and any user that passes
  an fd into that already knows what he wants to do with it. Making it
  work with raw sockets is just a natural extension to this, which works
  on all kernels and (with separate namespaces) is reasonably secure.
 
 Didn't realize that -net socket is already there and supports TCP and
 UDP sockets. I will look into extending -net socket to support AF_PACKET
 SOCK_RAW type sockets.

Actually, Jens had a patch doing this in early 2009 already but we
decided to not send that one out at the time after Or had sent his
version of the raw socket interface, which was a superset. Maybe Jens
can post his patch again if that still applies?

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-28 Thread Arnd Bergmann

On Wednesday 27 January 2010, Anthony Liguori wrote:
 
  Introducing something that is known to be problematic from a security
  perspective without any clear idea of what the use-case for it is is a
  bad idea IMHO.
   
  vepa on existing kernels is one use-case.
 
 
 Considering VEPA enabled hardware doesn't exist today and the standards 
 aren't even finished being defined, I don't think it's a really strong 
 use case ;-)

The hairpin turn (the part that is required on the bridge) was implemented
in the Linux bridge in 2.6.32, so that is one existing implementation you
can use as a peer.

The VEPA mode in macvlan only made it into 2.6.33, so using the raw socket
on older kernels does not give you actual VEPA semantics.

The part of the standard that is still under discussion is the management
side, which is almost entirely unrelated to this question though. With
Linux-2.6.33 on both sides using raw/macvlan and bridge respectively,
you can have a working VEPA setup. The only thing missing is that the
hypervisor will not be able to tell the bridge to automatically enable
hairpin mode (you need to do that on the bridge on a per-port basis).


Now, the most important use case I see for the raw socket interface
in qemu is to get vhost-net and the qemu user implementation to
support the same feature set. If you ask for a network setup involving
a raw socket and vhost-net and the kernel can support raw sockets
but for some reason fails to set up vhost-net, you should have a
fallback that has the exact same semantics at a possibly significant
performance loss.

Arnd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-28 Thread Arnd Bergmann

On Monday 25 January 2010, Dor Laor wrote:
 x86   qemu64
 x86   phenom
 x86 core2duo
 x86kvm64
 x86   qemu32
 x86  coreduo
 x86  486
 x86  pentium
 x86 pentium2
 x86 pentium3
 x86   athlon
 x86 n270

I think a really nice addition would be an autodetect option for those
users (e.g. desktop) that know they do not want to migrate the guest
to a lower-spec machine. 

That option IMHO should just show up as identical to the host cpu, with
the exception of features that are not supported in the guest.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-28 Thread Arnd Bergmann

On Thursday 28 January 2010, Anthony Liguori wrote:
 normal user uses libvirt to launch custom qemu instance.  libvirt passes 
 an fd of a raw socket to qemu and puts the qemu process in a restricted 
 network namespace.  user has another program running listening on a unix 
 domain socket and does something to the qemu process that causes it to 
 open the domain socket and send the fd it received from libvirt via 
 SCM_RIGHTS.

I looked at the af_unix code and it seems to suggest that this is not
possible, because you cannot bind to a socket that belongs to a different
network namespace. I haven't tried it though, so I may have missed
something.

Arnd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.

2010-02-10 Thread Arnd Bergmann

On Wednesday 10 February 2010, Xin Xiaohui wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.

This does a lot of things that I had planned for macvtap. It's
great to hear that you have made this much progress.

However, I'd hope that we could combine this with the macvtap driver,
which would give us zero-copy transfer capability both with and
without vhost, as well as (tx at least) when using multiple guests
on a macvlan setup.

For transmit, it should be fairly straightforward to hook up
your zero-copy method and the vhost-net interface into the
macvtap driver.

You have simplified the receiv path significantly by assuming
that the entire netdev can receive into a single guest, right?
I'm assuming that the idea is to allow VMDq adapters to simply
show up as separate adapters and have the driver handle this
in a hardware specific way.
My plan for this was to instead move support for VMDq into the
macvlan driver so we can transparently use VMDq on hardware where
available, including zero-copy receives, but fall back to software
operation on non-VMDq hardware.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Provide a zero-copy method on KVM virtio-net.

2010-02-11 Thread Arnd Bergmann

On Thursday 11 February 2010, Xin, Xiaohui wrote:
 This does a lot of things that I had planned for macvtap. It's
 great to hear that you have made this much progress.
 
 However, I'd hope that we could combine this with the macvtap driver,
 which would give us zero-copy transfer capability both with and
 without vhost, as well as (tx at least) when using multiple guests
 on a macvlan setup.
 
 You mean the zero-copy can work with macvtap driver without vhost.
 May you give me some detailed info about your macvtap driver and the
 relationship between vhost and macvtap to make me have a clear picture then?

macvtap provides a user interface that is largely compatible with
the tun/tap driver, and can be used in place of that from qemu.
Vhost-net currently interfaces with tun/tap, but not yet with macvtap,
which is easy enough to add and already on my list.

The underlying code is macvlan, which is a driver that virtualizes
network adapters in software, giving you multiple net_device instances
for a real NIC, each of them with their own MAC address.

In order to do zero-copy transmit with macvtap, the idea is to
add a nonblocking version of the aio_write() function that works
a lot like your transmit function.

For receive, the hardware does not currently know which guest
is supposed to get any frame coming in from the outside. Adding
zero-copy receive requires interaction with the device driver
and hardware capabilities to separate traffic by inbound MAC
address into separate buffers per VM.

 I'm assuming that the idea is to allow VMDq adapters to simply
 show up as separate adapters and have the driver handle this
 in a hardware specific way.
 
 Does the VMDq driver do so now?

I don't think anyone has published a VMDq capable driver so far.
I was just assuming that you were working on one.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Inter-VM shared memory PCI device

2010-03-09 Thread Arnd Bergmann

On Monday 08 March 2010, Cam Macdonell wrote:
 enum ivshmem_registers {
 IntrMask = 0,
 IntrStatus = 2,
 Doorbell = 4,
 IVPosition = 6,
 IVLiveList = 8
 };
 
 The first two registers are the interrupt mask and status registers.
 Interrupts are triggered when a message is received on the guest's eventfd 
 from
 another VM.  Writing to the 'Doorbell' register is how synchronization 
 messages
 are sent to other VMs.
 
 The IVPosition register is read-only and reports the guest's ID number.  The
 IVLiveList register is also read-only and reports a bit vector of currently
 live VM IDs.
 
 The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
 upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
 value which will be written to the destination VM and what the guest status
 register will be set to when the interrupt is trigger is the destination 
 guest.
 A value of 255 in the upper 8-bits will trigger a broadcast where the message
 will be sent to all other guests.

This means you have at least two intercepts for each message:

1. Sender writes to doorbell
2. Receiver gets interrupted

With optionally two more intercepts in order to avoid interrupting the
receiver every time:

3. Receiver masks interrupt in order to process data
4. Receiver unmasks interrupt when it's done and status is no longer pending

I believe you can do much better than this, you combine status and mask
bits, making this level triggered, and move to a bitmask of all guests:

In order to send an interrupt to another guest, the sender first checks
the bit for the receiver. If it's '1', no need for any intercept, the
receiver will come back anyway. If it's zero, write a '1' bit, which
gets OR'd into the bitmask by the host. The receiver gets interrupted
at a raising edge and just leaves the bit on, until it's done processing,
then turns the bit off by writing a '1' into its own location in the mask.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Inter-VM shared memory PCI device

2010-03-10 Thread Arnd Bergmann

On Tuesday 09 March 2010, Cam Macdonell wrote:
 
  We could make the masking in RAM, not in registers, like virtio, which would
  require no exits.  It would then be part of the application specific
  protocol and out of scope of of this spec.
 
 
 This kind of implementation would be possible now since with UIO it's
 up to the application whether to mask interrupts or not and what
 interrupts mean.  We could leave the interrupt mask register for those
 who want that behaviour.  Arnd's idea would remove the need for the
 Doorbell and Mask, but we will always need at least one MMIO register
 to send whatever interrupts we do send.

You'd also have to be very careful if the notification is in RAM to
avoid races between one guest triggering an interrupt and another
guest clearing its interrupt mask.

A totally different option that avoids this whole problem would
be to separate the signalling from the shared memory, making the
PCI shared memory device a trivial device with a single memory BAR,
and using something a higher-level concept like a virtio based
serial line for the actual signalling.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Inter-VM shared memory PCI device

2010-03-11 Thread Arnd Bergmann

On Thursday 11 March 2010, Avi Kivity wrote:
  A totally different option that avoids this whole problem would
  be to separate the signalling from the shared memory, making the
  PCI shared memory device a trivial device with a single memory BAR,
  and using something a higher-level concept like a virtio based
  serial line for the actual signalling.
 
 
 That would be much slower.  The current scheme allows for an 
 ioeventfd/irqfd short circuit which allows one guest to interrupt 
 another without involving their qemus at all.

Yes, the serial line approach would be much slower, but my point
was that we can do signaling over something else, which could
well be something building on irqfd.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Inter-VM shared memory PCI device

2010-03-11 Thread Arnd Bergmann

On Thursday 11 March 2010, Avi Kivity wrote:
  That would be much slower.  The current scheme allows for an
  ioeventfd/irqfd short circuit which allows one guest to interrupt
  another without involving their qemus at all.
   
  Yes, the serial line approach would be much slower, but my point
  was that we can do signaling over something else, which could
  well be something building on irqfd.
 
 Well, we could, but it seems to make things more complicated?  A card 
 with shared memory, and another card with an interrupt interconnect?

Yes, I agree that it's more complicated if you have a specific application
in mind that needs one of each, and most use cases that want shared memory
also need an interrupt mechanism, but it's not always the case:

- You could use ext2 with -o xip on a private mapping of a shared host file
in order to share the page cache. This does not need any interrupts.

- If you have more than two parties sharing the segment, there are different
ways to communicate, e.g. always send an interrupt to all others, or have
dedicated point-to-point connections. There is also some complexity in
trying to cover all possible cases in one driver.

I have to say that I also really like the idea of futex over shared memory,
which could potentially make this all a lot simpler. I don't know how this
would best be implemented on the host though.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: copyless virtio net thoughts?

2009-02-18 Thread Arnd Bergmann

On Wednesday 18 February 2009, Rusty Russell wrote:

 2) Direct NIC attachment
 This is particularly interesting with SR-IOV or other multiqueue nics,
 but for boutique cases or benchmarks, could be for normal NICs.  So
 far I have some very sketched-out patches: for the attached nic 
 dev_alloc_skb() gets an skb from the guest (which supplies them via
 some kind of AIO interface), and a branch in netif_receive_skb()
 which returned it to the guest.  This bypasses all firewalling in
 the host though; we're basically having the guest process drive
 the NIC directly.   

If this is not passing the PCI device directly to the guest, but
uses your concept, wouldn't it still be possible to use the firewalling
in the host? You can always inspect the headers, drop the frame, etc
without copying the whole frame at any point.

When it gets to the point of actually giving the (real pf or sr-iov vf)
to one guest, you really get to the point where you can't do local
firewalling any more.

 3) Direct interguest networking
 Anthony has been thinking here: vmsplice has already been mentioned.
 The idea of passing directly from one guest to another is an
 interesting one: using dma engines might be possible too.  Again,
 host can't firewall this traffic.  Simplest as a dedicated internal
 lan NIC, but we could theoretically do a fast-path for certain MAC
 addresses on a general guest NIC. 

Another option would be to use an SR-IOV adapter from multiple guests,
with a virtual ethernet bridge in the adapter. This moves the overhead
from the CPU to the bus and/or adapter, so it may or may not be a real
benefit depending on the workload.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: copyless virtio net thoughts?

2009-02-19 Thread Arnd Bergmann

On Thursday 19 February 2009, Rusty Russell wrote:

 Not quite: I think PCI passthrough IMHO is the *wrong* way to do it:
 it makes migrate complicated (if not impossible), and requires
 emulation or the same NIC on the destination host.  
 
 This would be the *host* seeing the virtual functions as multiple
 NICs, then the ability to attach a given NIC directly to a process.

I guess what you mean then is what Intel calls VMDq, not SR-IOV.
Eddie has some slides about this at
http://docs.huihoo.com/kvm/kvmforum2008/kdf2008_7.pdf .

The latest network cards support both operation modes, and it
appears to me that there is a place for both. VMDq gives you
the best performance without limiting flexibility, while SR-IOV
performance in theory can be even better, but sacrificing a
lot of flexibility and potentially local (guest-to-gest)
performance.

AFAICT, any card that supports SR-IOV should also allow a VMDq
like model, as you describe.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] remove static declaration from wall clock version

2009-02-26 Thread Arnd Bergmann

On Thursday 26 February 2009, Glauber Costa wrote:
 @@ -548,15 +548,13 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned 
 index, u64 *data)
  
  static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
  {
 -   static int version;
 +   int version = 1;
 struct pvclock_wall_clock wc;
 struct timespec now, sys, boot;
  
 if (!wall_clock)
 return;
  
 -   version++;
 -
 kvm_write_guest(kvm, wall_clock, version, sizeof(version));
  
 /*

Doesn't this mean that kvm_write_guest now writes an uninitialized value
to the guest?

I think what you need here is a 'static atomic_t version;' so you can
do an atomic_inc instead of the ++.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] remove static declaration from wall clock version

2009-02-26 Thread Arnd Bergmann

On Friday 27 February 2009, Glauber Costa wrote:
 
  Doesn't this mean that kvm_write_guest now writes an uninitialized value
  to the guest?
 No. If you look closely, it's now initialized to 1.

Right, I didn't see that change at first.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OT: Intel-Matrix for VT-capability?

2009-04-26 Thread Arnd Bergmann

On Friday 24 April 2009, Oliver Rath wrote:
Hi,

Im looking for an abillity seeing vt-capability of Intel-processors by
there name :-/

I.e. T7200 has vt, T3400 has not. Exists a rule for the naming scheme
seeing vt-capability? Alternatively, exists a matrix anywhere in the net
for this?

Im tired searching for vt-capability for every new OEM-intel-processor.
On intel-site _did_ exist a pdf-table (not all processors, but most of
T-series), but it seems to be removed.

http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors is quite
good here. As a rule of thumb, anything higher than 6000 will have VT,
anything below 6000 will not. Interesting exceptions are

Doesn't have VT: E7300, Q8200, Q8400, E8190
Does have VT: T5600, U2xxx, SU3xxx, Celeron 900 (?)
May have VT[1]: T5500, Q8300, E7400, E7500, E5300, E5400

Interestingly, when you look at the price list, you will see that
*all* processors that are not being obviously phased out (i.e. have
the same or higher price as a superior model) and carry a
Pentium or Core 2 name come with VT enabled. I think it's very
unlikely that they will come out with anything new that does not
run KVM.

Arnd

[1]
http://www.heise.de/newsticker/Auch-billigere-Intel-Prozessoren-bald-mit-Virtualisierungsbefehlen--/meldung/136306
[2] http://www.intc.com/priceList.cfm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: OT: Intel-Matrix for VT-capability?

2009-04-28 Thread Arnd Bergmann

On Tuesday 28 April 2009, Marc Bevand wrote:
   As a rule of thumb, anything higher than 6000 will have VT,
  anything below 6000 will not. Interesting exceptions are
  
  Doesn't have VT: E7300, Q8200, Q8400, E8190
  Does have VT: T5600, U2xxx, SU3xxx, Celeron 900 (?)
  May have VT[1]: T5500, Q8300, E7400, E7500, E5300, E5400
  
  Interestingly, when you look at the price list, you will see that
  *all* processors that are not being obviously phased out (i.e. have
  the same or higher price as a superior model) and carry a
  Pentium or Core 2 name come with VT enabled.
 
 This is very wrong:
 - none of the Pentium, Celeron, Atom processors, even the latest ones,
   come with VT

All Pentium and Celeron processors have numbers below 6000, so
they fit in the rule of thumb I gave above. The Celeron 900
is listed as having VT on the processorfinder, but that also
incorrectly lists it as having two cores, so who knows?

 - none of the Core 2 Duo E7xxx and Core 2 Quad Q8xxx support VT
 Be very careful into what you buy, check processorfinder.intel.com.

 Interestingly I found out that Intel will enable VT on a very small
 number of Core 2 and Pentium processors on June 12:
 http://www.tcmagazine.com/comments.php?shownews=25886

These are the ones I listed above as 'may have VT':
Q8300, E7400, E7500, E5300, E5400. The link I gave was to another
article mentioning these exact numbers. The E7300, Q8200 and Q8400
I listed as 'Doesn't have VT' are the remaining ones E7xxx and Q8xxx
processors, which appear to be phased out (not sure about Q8400,
which was announced at the same time as this news).

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 30/35] IOCTL for setting TSC rate

2010-08-21 Thread Arnd Bergmann

On Friday 20 August 2010 19:56:20 Glauber Costa wrote:
  @@ -675,6 +676,9 @@ struct kvm_clock_data {
   #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
   /* Available with KVM_CAP_PPC_GET_PVINFO */
   #define KVM_PPC_GET_PVINFO _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
  +/* Available with KVM_CAP_SET_TSC_RATE */
  +#define KVM_X86_GET_TSC_RATE  _IOR(KVMIO,  0xa2, __u32)
  +#define KVM_X86_SET_TSC_RATE  _IOW(KVMIO,  0xa3, __u32)
 
 wrap this into a struct?

I don't think that would improve the code. Generally, we try to *avoid* using
structs in ioctl arguments, although KVM does have a precedent of using structs
there.

In fact, the code here could be simplified by using get_user/put_user on the
simple argument, which would not be possibly with a struct.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Arnd Bergmann

On Wednesday 08 September 2010, Krishna Kumar2 wrote:
  The new guest and qemu code work with old vhost-net, just with reduced
  performance, yes?
 
 Yes, I have tested new guest/qemu with old vhost but using
 #numtxqs=1 (or not passing any arguments at all to qemu to
 enable MQ). Giving numtxqs  1 fails with ENOBUFS in vhost,
 since vhost_net_set_backend in the unmodified vhost checks
 for boundary overflow.
 
 I have also tested running an unmodified guest with new
 vhost/qemu, but qemu should not specify numtxqs1.

Can you live migrate a new guest from new-qemu/new-kernel
to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
If not, do we need to support all those cases?

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Arnd Bergmann

On Wednesday 08 September 2010, Krishna Kumar2 wrote:
  On Wednesday 08 September 2010, Krishna Kumar2 wrote:
The new guest and qemu code work with old vhost-net, just with
 reduced
performance, yes?
  
   Yes, I have tested new guest/qemu with old vhost but using
   #numtxqs=1 (or not passing any arguments at all to qemu to
   enable MQ). Giving numtxqs  1 fails with ENOBUFS in vhost,
   since vhost_net_set_backend in the unmodified vhost checks
   for boundary overflow.
  
   I have also tested running an unmodified guest with new
   vhost/qemu, but qemu should not specify numtxqs1.
 
  Can you live migrate a new guest from new-qemu/new-kernel
  to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
  If not, do we need to support all those cases?
 
 I have not tried this, though I added some minimal code in
 virtio_net_load and virtio_net_save. I don't know what needs
 to be done exactly at this time. I forgot to put this in the
 Next steps list of things to do.

I was mostly trying to find out if you think it should work
or if there are specific reasons why it would not.
E.g. when migrating to a machine that has an old qemu, the guest
gets reduced to a single queue, but it's not clear to me how
it can learn about this, or if it can get hidden by the outbound
qemu.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Arnd Bergmann

On Tuesday 14 September 2010, Shirley Ma wrote:
 On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:

  That's what io_submit() is for.  Then io_getevents() tells you what
  a 
  while actually was.
 
 This macvtap zero copy uses iov buffers from vhost ring, which is
 allocated from guest kernel. In host kernel, vhost calls macvtap
 sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
 pages for zero copy.
 
 The patch is relying on how vhost handle these buffers. I need to look
 at vhost code (qemu) first for addressing the questions here.

I guess the best solution would be to make macvtap_aio_write return
-EIOCBQUEUED when a packet gets passed down to the adapter, and
call aio_complete when the adapter is done with it.

This would change the regular behavior of macvtap into a model where
every write on the file blocks until the packet has left the machine,
which gives us better flow control, but does slow down the traffic
when we only put one packet at a time into the queue.

It also allows the user to call io_submit instead of write in order
to do an asynchronous submission as Avi was suggesting.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Arnd Bergmann

On Tuesday 28 September 2010, Michael S. Tsirkin wrote:
   +   skb_reserve(skb, NET_IP_ALIGN);
   +   skb_put(skb, len);
   +
   +   if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
   +   kfree_skb(skb);
   +   return -EAGAIN;
   +   }
   +
   +   skb-protocol = eth_type_trans(skb, mp-dev);
  
  Why are you calling eth_type_trans() on transmit?
 
 So that GSO can work.  BTW macvtap does:
 
 skb_set_network_header(skb, ETH_HLEN);
 skb_reset_mac_header(skb);
 skb-protocol = eth_hdr(skb)-h_proto;
 
 and I think this is broken for vlans. Arnd?

Hmm, that code (besides set_network_header) was added by Sridhar
for GSO support. I believe I originally did eth_type_trans but
had to change it before that time because it broke something.
Unfortunately, my memory on that is not very good any more.

Can you be more specific what the problem is? Do you think
it breaks when a guest sends VLAN tagged frames or when macvtap
is connected to a VLAN interface that adds another tag (or
only the combination)?

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Arnd Bergmann

On Tuesday 28 September 2010, Michael S. Tsirkin wrote:
 On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote:
  Can you be more specific what the problem is? Do you think
  it breaks when a guest sends VLAN tagged frames or when macvtap
  is connected to a VLAN interface that adds another tag (or
  only the combination)?

 I expect the protocol value to be wrong when guest sends vlan tagged
 frames as 802.1q frames have a different format.

Ok, I see. Would that be fixed by using eth_type_trans()? I don't
see any code in there that tries to deal with the VLAN tag, so
do we have the same problem in the tun/tap driver?

Also, I wonder how we handle the case where both the guest and
the host do VLAN tagging. Does the host transparently override
the guest tag, or does it add a nested tag? More importantly,
what should it do?

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann

On Tuesday 05 October 2010, Krishna Kumar2 wrote:
 After testing various combinations of #txqs, #vhosts, #netperf
 sessions, I think the drop for 1 stream is due to TX and RX for
 a flow being processed on different cpus.  I did two more tests:
 1. Pin vhosts to same CPU:
 - BW drop is much lower for 1 stream case (- 5 to -8% range)
 - But performance is not so high for more sessions.
 2. Changed vhost to be single threaded:
   - No degradation for 1 session, and improvement for upto
   8, sometimes 16 streams (5-12%).
   - BW degrades after that, all the way till 128 netperf sessions.
   - But overall CPU utilization improves.
 Summary of the entire run (for 1-128 sessions):
 txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
 txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
 
 I don't see any reasons mentioned above.  However, for higher
 number of netperf sessions, I see a big increase in retransmissions:
 ___
 #netperf  ORG   NEW
 BW (#retr)BW (#retr)
 ___
 1  70244 (0) 64102 (0)
 4  21421 (0) 36570 (416)
 8  21746 (0) 38604 (148)
 16 21783 (0) 40632 (464)
 32 22677 (0) 37163 (1053)
 64 23648 (4) 36449 (2197)
 12823251 (2) 31676 (3185)
 ___


This smells like it could be related to a problem that Ben Greear found
recently (see macvlan:  Enable qdisc backoff logic). When the hardware
is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN
to qemu (or vhost-net) to trigger a resend.

I suppose what we really should do is feed that condition back to the
guest network stack and implement the backoff in there.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann

On Wednesday 06 October 2010 19:14:42 Krishna Kumar2 wrote:
 Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM:
 
   I don't see any reasons mentioned above.  However, for higher
   number of netperf sessions, I see a big increase in retransmissions:
   ___
   #netperf  ORG   NEW
   BW (#retr)BW (#retr)
   ___
   1  70244 (0) 64102 (0)
   4  21421 (0) 36570 (416)
   8  21746 (0) 38604 (148)
   16 21783 (0) 40632 (464)
   32 22677 (0) 37163 (1053)
   64 23648 (4) 36449 (2197)
   12823251 (2) 31676 (3185)
   ___
 
 
  This smells like it could be related to a problem that Ben Greear found
  recently (see macvlan:  Enable qdisc backoff logic). When the hardware
  is busy, used to just drop the packet. With Ben's patch, we return
 -EAGAIN
  to qemu (or vhost-net) to trigger a resend.
 
  I suppose what we really should do is feed that condition back to the
  guest network stack and implement the backoff in there.
 
 Thanks for the pointer. I will take a look at this as I hadn't seen
 this patch earlier. Is there any way to figure out if this is the
 issue?

I think a good indication would be if this changes with/without the
patch, and if you see -EAGAIN in qemu with the patch applied.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed

2010-10-14 Thread Arnd Bergmann

On Thursday 14 October 2010 21:58:08 Alex Williamson wrote:
 If it works anywhere (I assume it works on 32bit), then it's only
 because it happened to get the alignment right.  This just makes 64bit
 hosts get it right too.  I don't see any compatibility issues,
 non-packed + 64bit = broken.  Thanks,

I would actually assume that only x86-32 hosts got it right, because
all 32 bit hosts I've seen other than x86 also define 8 byte alignment
for uint64_t.

You might however consider making it 

__attribute((__packed__, __aligned__(4)))

instead of just packed, because otherwise you make the alignment one byte,
which is not only different from what it used to be on x86-32 but also
will cause inefficient compiler outpout on platforms that don't have unaligned
word accesses in hardware.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH] pc: e820 qemu_cfg tables need to be packed

2010-10-14 Thread Arnd Bergmann

On Thursday 14 October 2010 22:59:04 Alex Williamson wrote:
 The structs in question only contain 4  8 byte elements, so there
 shouldn't be any change on x86-32 using one-byte aligned packing.

I'm talking about the alignment of the structure, not the members
within the structure. The data structure should be compatible, but
not accesses to it.

 AFAIK, e820 is x86-only, so we don't need to worry about breaking anyone
 else.

You can use qemu to emulate an x86 pc on anything...

 Performance isn't much of a consideration for this type of
 interface since it's only used pre-boot.  In fact, the channel between
 qemu and the bios is only one byte wide, so wider alignment can cost
 extra emulated I/O accesses.

Right, the data gets passed as bytes, so it hardly matters in the end.
Still the e820_add_entry assigns data to the struct members, which
it either does using byte accesses and shifts or a multiple 32 bit
assignment. Just because using a one byte alignment technically
results in correct output doesn't make it the right solution.

I don't care about the few cycles of execution time or the few bytes
you waste in this particular case, but you are setting a wrong example
by using smaller alignment than necessary.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] pc: e820 qemu_cfg tables need to be packed

2010-10-15 Thread Arnd Bergmann

On Friday 15 October 2010, Alex Williamson wrote:
 We can't let the compiler define the alignment for qemu_cfg data.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  v2: Adjust alignment to help non-x86 hosts per Arnd's suggestion

Ok, looks good now. Thanks!

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TODO item: guest programmable mac/vlan filtering with macvtap

2010-10-18 Thread Arnd Bergmann

On Friday 15 October 2010, Michael S. Tsirkin wrote:
 On Thu, Oct 14, 2010 at 11:40:52PM +0200, Dragos Tatulea wrote:
  Hi,
  
  I'm starting a  thread related to the TODO item mentioned in the
  subject. Currently still gathering info and trying to make kvm 
  macvtap play nicely together. I have used this [1] guide to set it up
  but qemu is still complaining about the PCI device address of the
  virtio-net-pci. Tried with latest qemu. Am I missing something here?
  
  [1] - http://virt.kernelnewbies.org/MacVTap
  
 
 It really should be:
  -net nic,model=virtio,netdev=foo -netdev tap,id=foo
 
 Created account but still could not edit
 the wiki. Arnd, know why that is? Could you correct qemu
 command line pls?

I also have lost write access to the wiki, no idea what happened there.
I started the page, but it subsequently became protected.

We never added support for the qemu command line directly, the
plan was to do that using helper scripts.

The only way to do it is to redirect both input and output
to the tap device, so you ned to do

-net nic,model=virtio,netdev=foo -netdev tap,id=foo,fd=3 3

when starting from bash.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 02/22] bitops: rename generic little-endian bitops functions

2010-10-21 Thread Arnd Bergmann

On Thursday 21 October 2010, Akinobu Mita wrote:
 As a preparation for providing little-endian bitops for all architectures,
 This removes generic_ prefix from little-endian bitops function names
 in asm-generic/bitops/le.h.
 
 s/generic_find_next_le_bit/find_next_le_bit/
 s/generic_find_next_zero_le_bit/find_next_zero_le_bit/
 s/generic_find_first_zero_le_bit/find_first_zero_le_bit/
 s/generic___test_and_set_le_bit/__test_and_set_le_bit/
 s/generic___test_and_clear_le_bit/__test_and_clear_le_bit/
 s/generic_test_le_bit/test_le_bit/
 s/generic___set_le_bit/__set_le_bit/
 s/generic___clear_le_bit/__clear_le_bit/
 s/generic_test_and_set_le_bit/test_and_set_le_bit/
 s/generic_test_and_clear_le_bit/test_and_clear_le_bit/
 
 Signed-off-by: Akinobu Mita akinobu.m...@gmail.com
 Cc: Hans-Christian Egtvedt hans-christian.egtv...@atmel.com
 Cc: Geert Uytterhoeven ge...@linux-m68k.org
 Cc: Roman Zippel zip...@linux-m68k.org
 Cc: Andreas Schwab sch...@linux-m68k.org
 Cc: linux-m...@lists.linux-m68k.org
 Cc: Greg Ungerer g...@uclinux.org
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Paul Mackerras pau...@samba.org
 Cc: linuxppc-...@lists.ozlabs.org
 Cc: Andy Grover andy.gro...@oracle.com
 Cc: rds-de...@oss.oracle.com
 Cc: David S. Miller da...@davemloft.net
 Cc: net...@vger.kernel.org
 Cc: Avi Kivity a...@redhat.com
 Cc: Marcelo Tosatti mtosa...@redhat.com
 Cc: kvm@vger.kernel.org

Acked-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices

2010-10-27 Thread Arnd Bergmann

On Wednesday 27 October 2010, Sridhar Samudrala wrote:
 Support a new 'passthru' mode with macvlan and 'mode' parameter
 with macvtap devices.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com

Can you split this into two patches?

We definitely want the part adding support for macvtap device mode
setting now. The new passthru mode for macvlan and macvtap probably
needs some discussion and the patch in iproute2 will depends on
the kernel patch getting merged first.

I've added Stephen to the Cc list, he should also take a look.

Arnd

 diff --git a/include/linux/if_link.h b/include/linux/if_link.h
 index f5bb2dc..23de79e 100644
 --- a/include/linux/if_link.h
 +++ b/include/linux/if_link.h
 @@ -230,6 +230,7 @@ enum macvlan_mode {
   MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
   MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */
   MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
 + MACVLAN_MODE_PASSTHRU  = 8, /* take over the underlying device */
  };
  
  /* SR-IOV virtual function management section */
 diff --git a/ip/Makefile b/ip/Makefile
 index 2f223ca..6054e8a 100644
 --- a/ip/Makefile
 +++ b/ip/Makefile
 @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
  ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
  ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
  iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
 -iplink_macvlan.o
 +iplink_macvlan.o iplink_macvtap.o
  
  RTMONOBJ=rtmon.o
  
 diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
 index a3c78bd..97787f9 100644
 --- a/ip/iplink_macvlan.c
 +++ b/ip/iplink_macvlan.c
 @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int 
 argc, char **argv,
   mode = MACVLAN_MODE_VEPA;
   else if (strcmp(*argv, bridge) == 0)
   mode = MACVLAN_MODE_BRIDGE;
 + else if (strcmp(*argv, passthru) == 0)
 + mode = MACVLAN_MODE_PASSTHRU;
   else
   return mode_arg();
  
 @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE 
 *f, struct rtattr *tb[]
 mode == MACVLAN_MODE_PRIVATE ? private
   : mode == MACVLAN_MODE_VEPA? vepa
   : mode == MACVLAN_MODE_BRIDGE  ? bridge
 + : mode == MACVLAN_MODE_PASSTHRU  ? passthru
   :unknown);
  }
  
 diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
 new file mode 100644
 index 000..040cc68
 --- /dev/null
 +++ b/ip/iplink_macvtap.c
 @@ -0,0 +1,93 @@
 +/*
 + * iplink_macvtap.c  macvtap device support
 + *
 + *  This program is free software; you can redistribute it and/or
 + *  modify it under the terms of the GNU General Public License
 + *  as published by the Free Software Foundation; either version
 + *  2 of the License, or (at your option) any later version.
 + */
 +
 +#include stdio.h
 +#include stdlib.h
 +#include string.h
 +#include sys/socket.h
 +#include linux/if_link.h
 +
 +#include rt_names.h
 +#include utils.h
 +#include ip_common.h
 +
 +static void explain(void)
 +{
 + fprintf(stderr,
 + Usage: ... macvtap mode { private | vepa | bridge | passthru 
 }\n
 + );
 +}
 +
 +static int mode_arg(void)
 +{
 +fprintf(stderr, Error: argument of \mode\ must be \private\, 
 + \vepa\ or \bridge\ \passthru\\n);
 +return -1;
 +}
 +
 +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
 +   struct nlmsghdr *n)
 +{
 + while (argc  0) {
 + if (matches(*argv, mode) == 0) {
 + __u32 mode = 0;
 + NEXT_ARG();
 +
 + if (strcmp(*argv, private) == 0)
 + mode = MACVLAN_MODE_PRIVATE;
 + else if (strcmp(*argv, vepa) == 0)
 + mode = MACVLAN_MODE_VEPA;
 + else if (strcmp(*argv, bridge) == 0)
 + mode = MACVLAN_MODE_BRIDGE;
 + else if (strcmp(*argv, passthru) == 0)
 + mode = MACVLAN_MODE_PASSTHRU;
 + else
 + return mode_arg();
 +
 + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
 + } else if (matches(*argv, help) == 0) {
 + explain();
 + return -1;
 + } else {
 + fprintf(stderr, macvtap: what is \%s\?\n, *argv);
 + explain();
 + return -1;
 + }
 + argc--, argv++;
 + }
 +
 + return 0;
 +}
 +
 +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr 
 *tb[])
 +{
 + __u32 mode;
 +
 + if

Re: [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device

2010-10-27 Thread Arnd Bergmann

On Wednesday 27 October 2010, Sridhar Samudrala wrote:
 With the current default macvtap mode, a KVM guest using virtio with 
 macvtap backend has the following limitations.
 - cannot change/add a mac address on the guest virtio-net
 - cannot create a vlan device on the guest virtio-net
 - cannot enable promiscuous mode on guest virtio-net
 
 This patch introduces a new mode called 'passthru' when creating a 
 macvlan device which allows takeover of the underlying device and 
 passing it to a guest using virtio with macvtap backend.
 
 Only one macvlan device is allowed in passthru mode and it inherits
 the mac address from the underlying device and sets it in promiscuous 
 mode to receive and forward all the packets.

Interesting approach. It somewhat stretches the definition of the
macvlan concept, but it does sound useful to have.

I was thinking about adding a new tap frontend driver that could
share some code with macvtap and do only the takeover but not
use macvlan as a base. I believe that would be a cleaner abstraction,
but your code has two advantages in that the implementation is much
simpler and that it can share a fair amount of the infrastructure
that we're putting into qemu/libvirt/etc.

Arnd

PS: Please add a Signed-off-by: line when sending a patch, even for
discussion.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH iproute2] Support 'mode' parameter when creating macvtap device

2010-10-29 Thread Arnd Bergmann

On Friday 29 October 2010, Sridhar Samudrala wrote:
 Add support for 'mode' parameter when creating a macvtap device.
 This allows a macvtap device to be created in bridge, private or
 the default vepa modes.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com

Acked-by: Arnd Bergmann a...@arndb.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] macvlan: Introduce 'passthru' mode to takeover the underlying device

2010-10-29 Thread Arnd Bergmann

On Friday 29 October 2010, Sridhar Samudrala wrote:
 With the current default 'vepa' mode, a KVM guest using virtio with 
 macvtap backend has the following limitations.
 - cannot change/add a mac address on the guest virtio-net

I believe this could be changed if there is a neeed, but I actually
consider it one of the design points of macvlan that the guest
is not able to change the mac address. With 802.1Qbg you rely on
the switch being able to identify the guest by its MAC address,
which the host kernel must ensure.

 - cannot create a vlan device on the guest virtio-net

Why not? If this doesn't work, it's probably a bug!
Why does the passthru mode enable it if it doesn't work
already?

 - cannot enable promiscuous mode on guest virtio-net

Could you elaborate why such a setup would be useful?

Arnd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/10] UAPI: Put a comment into uapi/asm-generic/kvm_para.h and use it from arches

2012-10-17 Thread Arnd Bergmann

On Wednesday 17 October 2012, David Howells wrote:
 Make uapi/asm-generic/kvm_para.h non-empty by addition of a comment to stop
 the patch program from deleting it when it creates it.
 
 Then delete empty arch-specific uapi/asm/kvm_para.h files and tell the Kbuild
 files to use the generic instead.
 
 Should this perhaps instead be a #warning or #error that the facility is
 unsupported on this arch?

Just an empty file is fine by me, but an #error also sounds reasonable if
we want users to be able to write autoconf tests for it.

 Signed-off-by: David Howells dhowe...@redhat.com
 cc: Arnd Bergmann a...@arndb.de
 cc: Avi Kivity a...@redhat.com
 cc: Marcelo Tosatti mtosa...@redhat.com
 cc: kvm@vger.kernel.org

Acked-by: Arnd Bergmann a...@arndb.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH resend] compat_ioctl: fix warning caused by qemu

2011-07-01 Thread Arnd Bergmann

On Friday 01 July 2011, Johannes Stezenbach wrote:
 
 On Linux x86_64 host with 32bit userspace, running
 qemu or even just qemu-img create -f qcow2 some.img 1G
 causes a kernel warning:
 
 ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(5326){t:'S';sz:0} 
 arg(7fff) on some.img
 ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} 
 arg(fff77350) on some.img
 
 ioctl 5326 is CDROM_DRIVE_STATUS,
 ioctl 801c0204 is FDGETPRM.
 
 The warning appears because the Linux compat-ioctl handler for these
 ioctls only applies to block devices, while qemu also uses the ioctls on
 plain files.
 
 Signed-off-by: Johannes Stezenbach j...@sig21.net

Acked-by: Arnd Bergmann a...@arndb.de

 ---
 (resend with Cc: suggested by get_maintainer.pl)
 
 discussed in http://lkml.kernel.org/r/20110617090424.ga19...@sig21.net
 
 Arnd, is this what you had in mind, or did you mean to move
 all floppy compat definitions?  I decided to go with the
 minimal change.  Tested on both 2.6.39.2 and 3.0-rc5-63-g0d72c6f.

Yes, that should be fine, unless Jens would like to see a different
solution for the struct definitions, e.g. moving all of the floppy
compat ioctl numbers to fd.h. I'm fine with it either way.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH] KVM: introduce xinterface API for external interaction with guests

2009-07-16 Thread Arnd Bergmann

On Thursday 16 July 2009, Gregory Haskins wrote:
 Background: The original vbus code was tightly integrated with kvm.ko.  Avi
 suggested that we abstract the interfaces such that it could live outside
 of kvm.

The code is still highly kvm-specific, you would not be able to use
it with another hypervisor like lguest or vmware player, right?

 Example usage: QEMU instantiates a guest, and an external module foo
 that desires the ability to interface with the guest (say via
 open(/dev/foo)).  QEMU may then issue a KVM_GET_VMID operation to acquire
 the u64-based vmid, and pass it to ioctl(foofd, FOO_SET_VMID, vmid).
 Upon receipt, the foo module can issue kvm_xinterface_find(vmid) to acquire
 the proper context.  Internally, the struct kvm* and associated
 struct module* will remain pinned at least until the foo module calls
 kvm_xinterface_put().

Your approach allows passing the vmid from a process that does
not own the kvm context. This looks like an intentional feature,
but I can't see what this gains us. 

 As a final measure, we link the xinterface code statically
 into the kernel so that callers are guaranteed a stable interface to
 kvm_xinterface_find() without implicitly pinning kvm.ko or racing against
 it.

I also don't understand this. Are you worried about driver modules
breaking when an externally-compiled kvm.ko is loaded? The same could
be achieved by defining your data structures kvm_xinterface_ops and
kvm_xinterface in a kernel header that is not shipped by kvm-kmod but
always taken from the kernel headers.
It does not matter if the entry points are build into the kernel or
exported from a kvm.ko as long as you define a fixed ABI.

What is the problem with pinning kvm.ko from another module using
its features?

Can't you simply provide a function call to lookup the kvm context
pointer from the file descriptor to achieve the same functionality?

To take that thought further, maybe the dependency can be turned
around: If every user (pci-uio, virtio-net, ...) exposes a file
descriptor based interface to user space, you can have a kvm
ioctl to register the object behind that file descriptor with
an existing kvm context to associate it with a guest. That would
nicely solve the life time questions by pinning the external
object for the life time of the kvm context rather than the other
way round, and it would be completely separate from kvm in that
each such object could be used by other subsystems independent
of kvm.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM PATCH] KVM: introduce xinterface API for external interaction with guests

2009-07-16 Thread Arnd Bergmann

On Thursday 16 July 2009, Gregory Haskins wrote:
 Arnd Bergmann wrote:  

  Your approach allows passing the vmid from a process that does
  not own the kvm context. This looks like an intentional feature,
  but I can't see what this gains us.
 
 This work is towards the implementation of lockless-shared-memory
 subsystems, which includes ring constructs such as virtio-ring,
 VJ-netchannels, and vbus-ioq.   I find that these designs perform
 optimally when you allow two distinct contexts (producer + consumer) to
 process the ring concurrently, which implies a disparate context from
 the guest in question.  Note that the infrastructure we are discussing
 does not impose a requirement for the contexts to be unique: it will
 work equally well from the same or a different process.
 
 For an example of this producer/consumer dynamic over shared memory in
 action, please refer to my previous posting re: vbus
 
 http://lkml.org/lkml/2009/4/21/408
 
 I am working on v4 now, and this patch is part of the required support.

Ok. I can see how your approach gives you more flexibility in this
regard, but it does not seem critical.

 But to your point, I suppose the dependency lifetime thing is not a huge
 deal.  I could therefore modify the patch to simply link xinterface.o
 into kvm.ko and still achieve the primary objective by retaining ops-owner.

Right. And even if it's a separate module, holding an extra reference
on kvm.ko will not cause any harm.

  Can't you simply provide a function call to lookup the kvm context
  pointer from the file descriptor to achieve the same functionality?

 You mean so have: struct kvm_xinterface *kvm_xinterface_find(int fd)
 
 (instead of creating our own vmid namespace) ?
 
 Or are you suggesting using fget() instead of kvm_xinterface_find()?

I guess they are roughly equivalent. Either you pass a fd to
kvm_xinterface_find, or pass the struct file pointer you get
from fget. The latter is probably more convenient because it
allows you to pass around the struct file in kernel contexts
that don't have that file descriptor open.

  To take that thought further, maybe the dependency can be turned
  around: If every user (pci-uio, virtio-net, ...) exposes a file
  descriptor based interface to user space, you can have a kvm
  ioctl to register the object behind that file descriptor with
  an existing kvm context to associate it with a guest.
 
 FWIW: We do that already for the signaling path (see irqfd and ioeventfd
 in kvm.git).  Each side exposes interfaces that accept eventfds, and the
 fds are passed around that way.
 
 However, for the functions we are talking about now, I don't think it
 really works well to go the other way.  I could be misunderstanding what
 you mean, though.  What I mean is that it's KVM that is providing a
 service to the other modules (in this case, translating memory
 pointers), so what would an inverse interface look like for that?  And
 even if you came up with one, it seems to me that its just 6 of one,
 half-dozen of the other kind of thing.

I mean something like 

int kvm_ioctl_register_service(struct file *filp, unsigned long arg)
{
struct file *service = fget(arg);
struct kvm *kvm = filp-private_data;

if (!service-f_ops-new_xinterface_register)
return -EINVAL;

return service-f_ops-new_xinterface_register(service, (void*)kvm,
kvm_xinterface_ops);
}

This would assume that we define a new file_operation specifically for this,
which would simplify the code, but there are other ways to achieve the same.
It would even mean that you don't need any static code as an interface layer.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM_AUTOTEST] set English environment

2009-07-20 Thread Arnd Bergmann

On Thursday 09 July 2009, Lukáš Doktor wrote:
   --- orig/client/tests/kvm/control   2009-07-08 13:18:07.0 +0200
 +++ new/client/tests/kvm/control2009-07-09 12:32:32.0 +0200
 @@ -45,6 +45,8 @@ Each test is appropriately documented on
  
  import sys, os
  
 +# set English environment
 +os.environ['LANG'] = 'en_US.UTF-8'
  # enable modules import from current directory (tests/kvm)
  pwd = os.path.join(os.environ['AUTODIR'],'tests/kvm')
  sys.path.append(pwd)

LANG can still be overridden with LC_ALL. For a well-defined environment,
best set LC_ALL='C'. This will also set other i18n settings and works
on systems that don't come with UTF-8 enabled.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:

2009-08-06 Thread Arnd Bergmann

On Thursday 06 August 2009, Gregory Haskins wrote:
 We can exchange out the virtio-pci module like this:
 
   (guest-side)
 |--
 | virtio-net
 |--
 | virtio-ring
 |--
 | virtio-bus
 |--
 | virtio-vbus
 |--
 | vbus-proxy
 |--
 | vbus-connector
 |--
   |
(vbus)
   |
 |--
 | kvm.ko
 |--
 | vbus-connector
 |--
 | vbus
 |--
 | virtio-net-tap (vbus model)
 |--
 | netif
 |--
  (host-side)
 
 
 So virtio-net runs unmodified.  What is competing here is virtio-pci vs 
 virtio-vbus.
 Also, venet vs virtio-net are technically competing.  But to say virtio vs 
 vbus is inaccurate, IMO.


I think what's confusing everyone is that you are competing on multiple
issues:

1. Implementation of bus probing: both vbus and virtio are backed by
PCI devices and can be backed by something else (e.g. virtio by lguest
or even by vbus).

2. Exchange of metadata: virtio uses a config space, vbus uses devcall
to do the same.

3. User data transport: virtio has virtqueues, vbus has shm/ioq.

I think these three are the main differences, and the venet vs. virtio-net
question comes down to which interface the drivers use for each aspect. Do
you agree with this interpretation?

Now to draw conclusions from each of these is of course highly subjective,
but this is how I view it:

1. The bus probing is roughly equivalent, they both work and the
virtio method seems to need a little less code but that could be fixed
by slimming down the vbus code as I mentioned in my comments on the
pci-to-vbus bridge code. However, I would much prefer not to have both
of them, and virtio came first.

2. the two methods (devcall/config space) are more or less equivalent
and you should be able to implement each one through the other one. The
virtio design was driven by making it look similar to PCI, the vbus
design was driven by making it easy to implement in a host kernel. I
don't care too much about these, as they can probably coexist without
causing any trouble. For a (hypothetical) vbus-in-virtio device,
a devcall can be a config-set/config-get pair, for a virtio-in-vbus,
you can do a config-get and a config-set devcall and be happy. Each
could be done in a trivial helper library.

3. The ioq method seems to be the real core of your work that makes
venet perform better than virtio-net with its virtqueues. I don't see
any reason to doubt that your claim is correct. My conclusion from
this would be to add support for ioq to virtio devices, alongside
virtqueues, but to leave out the extra bus_type and probing method.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] net: fix vnet_hdr bustage with slirp

2009-08-07 Thread Arnd Bergmann

On Friday 07 August 2009, Mark McLoughlin wrote:
 slirp has started using VLANClientState::opaque and this has caused the
 kvm specific tap_has_vnet_hdr() hack to break because we blindly use
 this opaque pointer even if it is not a tap client.
 
 Add yet another hack to check that we're actually getting called with a
 tap client.
 
 [Needed on stable-0.11 too]
 
 Signed-off-by: Mark McLoughlin mar...@redhat.com

Jens and I discovered the same bug before, but then we forgot about
sending a fix (sorry). Your patch should work fine as a workaround,
but I wonder if it is the right solution. 

The abstraction of struct VLANClientState is otherwise done through
function pointers taking the VLANClientState pointer as their
first argument. IMHO a cleaner abstraction would be to do the same
for tap_has_vnet_hdr(), like the patch below, and similar for
other functions passing 'opaque' pointers.

Signed-off-by: Arnd Bergmann a...@arndb.de

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 6dfe758..6b34e82 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -123,7 +123,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 VirtIONet *n = to_virtio_net(vdev);
 VLANClientState *host = n-vc-vlan-first_client;
 
-if (tap_has_vnet_hdr(host)) {
+if (host-has_vnet_hdr  host-has_vnet_hdr(host)) {
 tap_using_vnet_hdr(host, 1);
 features |= (1  VIRTIO_NET_F_CSUM);
 features |= (1  VIRTIO_NET_F_GUEST_CSUM);
@@ -166,7 +166,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint32_t features)
 n-mergeable_rx_bufs = !!(features  (1  VIRTIO_NET_F_MRG_RXBUF));
 
 #ifdef TAP_VNET_HDR
-if (!tap_has_vnet_hdr(host) || !host-set_offload)
+if (!(host-has_vnet_hdr  host-has_vnet_hdr(host)) || 
!host-set_offload)
 return;
 
 host-set_offload(host,
@@ -398,7 +398,7 @@ static int receive_header(VirtIONet *n, struct iovec *iov, 
int iovcnt,
 hdr-gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
 #ifdef TAP_VNET_HDR
-if (tap_has_vnet_hdr(n-vc-vlan-first_client)) {
+if ((host-has_vnet_hdr  host-has_vnet_hdr(n-vc-vlan-first_client))) 
{
 memcpy(hdr, buf, sizeof(*hdr));
 offset = sizeof(*hdr);
 work_around_broken_dhclient(hdr, buf + offset, size - offset);
@@ -425,7 +425,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, 
int size)
 return 1;
 
 #ifdef TAP_VNET_HDR
-if (tap_has_vnet_hdr(n-vc-vlan-first_client))
+if ((host-has_vnet_hdr 
+   host-has_vnet_hdr(n-vc-vlan-first_client)))
 ptr += sizeof(struct virtio_net_hdr);
 #endif
 
@@ -529,7 +530,8 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
 VirtQueueElement elem;
 #ifdef TAP_VNET_HDR
-int has_vnet_hdr = tap_has_vnet_hdr(n-vc-vlan-first_client);
+int has_vnet_hdr = (host-has_vnet_hdr 
+   host-has_vnet_hdr(n-vc-vlan-first_client));
 #else
 int has_vnet_hdr = 0;
 #endif
@@ -620,7 +622,7 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
 qemu_put_buffer(f, (uint8_t *)n-vlans, MAX_VLAN  3);
 
 #ifdef TAP_VNET_HDR
-qemu_put_be32(f, tap_has_vnet_hdr(n-vc-vlan-first_client));
+qemu_put_be32(f, (host-has_vnet_hdr  
host-has_vnet_hdr(n-vc-vlan-first_client)));
 #else
 qemu_put_be32(f, 0);
 #endif
diff --git a/net.c b/net.c
index 931def1..b56ae78 100644
--- a/net.c
+++ b/net.c
@@ -754,7 +754,7 @@ static void vmchannel_read(void *opaque, const uint8_t 
*buf, int size)
 
 #ifdef _WIN32
 
-int tap_has_vnet_hdr(void *opaque)
+static int tap_has_vnet_hdr(struct VLANClientState *vc)
 {
 return 0;
 }
@@ -906,9 +906,8 @@ static void tap_send(void *opaque)
 } while (s-size  0);
 }
 
-int tap_has_vnet_hdr(void *opaque)
+static int tap_has_vnet_hdr(struct VLANClientState *vc)
 {
-VLANClientState *vc = opaque;
 TAPState *s = vc-opaque;
 
 return s ? s-has_vnet_hdr : 0;
@@ -991,6 +990,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
 s-has_vnet_hdr = vnet_hdr != 0;
 s-vc = qemu_new_vlan_client(vlan, model, name, tap_receive,
  NULL, tap_cleanup, s);
+s-vc-has_vnet_hdr = tap_has_vnet_hdr;
 s-vc-fd_readv = tap_receive_iov;
 #ifdef TUNSETOFFLOAD
 s-vc-set_offload = tap_set_offload;
diff --git a/net.h b/net.h
index bc42428..7c79734 100644
--- a/net.h
+++ b/net.h
@@ -21,6 +21,7 @@ struct VLANClientState {
 IOCanRWHandler *fd_can_read;
 NetCleanup *cleanup;
 LinkStatusChanged *link_status_changed;
+int (*has_vnet_hdr)(struct VLANClientState *);
 int link_down;
 SetOffload *set_offload;
 void *opaque;
@@ -72,7 +73,6 @@ void qemu_handler_true(void *opaque);
 void do_info_network(Monitor *mon);
 int do_set_link(Monitor *mon, const char *name, const char *up_or_down);
 
-int tap_has_vnet_hdr(void *opaque);
 void tap_using_vnet_hdr(void *opaque, int using_vnet_hdr);
 
 /* NIC info */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord

Re: [PATCH 1/1] net: fix vnet_hdr bustage with slirp

2009-08-07 Thread Arnd Bergmann

On Friday 07 August 2009, Mark McLoughlin wrote:
 The vnet_hdr code in qemu-kvm.git is a hack which we plan to
 (eventually) replace by allowing a nic to be paired directly with a
 backend.
 
 Your patch is fine, but I'd suggest since both are a hack we stick with
 mine since it'll reduce merge conflicts. Both hacks will go away
 eventually, anyway.

Ok, sounds good.

Thanks,

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Arnd Bergmann

On Monday 10 August 2009, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

Very nice, I loved reading it. It's getting rather late in my time
zone, so this comments only on the network driver. I'll go through
the rest tomorrow.

 @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device 
 *vdev)
   err = PTR_ERR(vblk-vq);
   goto out_free_vblk;
   }
 + printk(KERN_ERR vblk-vq = %p\n, vblk-vq);
  
   vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
   if (!vblk-pool) {
 @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device 
 *vdev)
   if (!err)
   blk_queue_logical_block_size(vblk-disk-queue, blk_size);
  
 + printk(KERN_ERR virtio_config_val returned %d\n, err);
 +
   add_disk(vblk-disk);
   return 0;

I guess you meant to remove these before submitting.

 +static void handle_tx_kick(struct work_struct *work);
 +static void handle_rx_kick(struct work_struct *work);
 +static void handle_tx_net(struct work_struct *work);
 +static void handle_rx_net(struct work_struct *work);

[style] I think the code gets more readable if you reorder it
so that you don't need forward declarations for static functions.

 +static long vhost_net_reset_owner(struct vhost_net *n)
 +{
 + struct socket *sock = NULL;
 + long r;
 + mutex_lock(n-dev.mutex);
 + r = vhost_dev_check_owner(n-dev);
 + if (r)
 + goto done;
 + sock = vhost_net_stop(n);
 + r = vhost_dev_reset_owner(n-dev);
 +done:
 + mutex_unlock(n-dev.mutex);
 + if (sock)
 + fput(sock-file);
 + return r;
 +}

what is the difference between vhost_net_reset_owner(n)
and vhost_net_set_socket(n, -1)?

 +
 +static struct file_operations vhost_net_fops = {
 + .owner  = THIS_MODULE,
 + .release= vhost_net_release,
 + .unlocked_ioctl = vhost_net_ioctl,
 + .open   = vhost_net_open,
 +};

This is missing a compat_ioctl pointer. It should simply be

static long vhost_net_compat_ioctl(struct file *f,
unsigned int ioctl, unsigned long arg)
{
return f, ioctl, (unsigned long)compat_ptr(arg);
}

 +/* Bits from fs/aio.c. TODO: export and use from there? */
 +/*
 + * use_mm
 + *   Makes the calling kernel thread take on the specified
 + *   mm context.
 + *   Called by the retry thread execute retries within the
 + *   iocb issuer's mm context, so that copy_from/to_user
 + *   operations work seamlessly for aio.
 + *   (Note: this routine is intended to be called only
 + *   from a kernel thread context)
 + */
 +static void use_mm(struct mm_struct *mm)
 +{
 + struct mm_struct *active_mm;
 + struct task_struct *tsk = current;
 +
 + task_lock(tsk);
 + active_mm = tsk-active_mm;
 + atomic_inc(mm-mm_count);
 + tsk-mm = mm;
 + tsk-active_mm = mm;
 + switch_mm(active_mm, mm, tsk);
 + task_unlock(tsk);
 +
 + mmdrop(active_mm);
 +}

Why do you need a kernel thread here? If the data transfer functions
all get called from a guest intercept, shouldn't you already be
in the right mm?

 +static void handle_tx(struct vhost_net *net)
 +{
 + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 + unsigned head, out, in;
 + struct msghdr msg = {
 + .msg_name = NULL,
 + .msg_namelen = 0,
 + .msg_control = NULL,
 + .msg_controllen = 0,
 + .msg_iov = (struct iovec *)vq-iov + 1,
 + .msg_flags = MSG_DONTWAIT,
 + };
 + size_t len;
 + int err;
 + struct socket *sock = rcu_dereference(net-sock);
 + if (!sock || !sock_writeable(sock-sk))
 + return;
 +
 + use_mm(net-dev.mm);
 + mutex_lock(vq-mutex);
 + for (;;) {
 + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in);
 + if (head == vq-num)
 + break;
 + if (out = 1 || in) {
 + vq_err(vq, Unexpected descriptor format for TX: 
 +out %d, int %d\n, out, in);
 + break;
 + }
 + /* Sanity check */
 + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
 + vq_err(vq, Unexpected header len for TX: 
 +%ld expected %zd\n, vq-iov-iov_len,
 +sizeof(struct virtio_net_hdr));
 + break;
 + }
 + /* Skip header. TODO: support TSO. */
 + msg.msg_iovlen = out - 1;
 + len = iov_length(vq-iov + 1, out - 1);
 + /* TODO: Check specific error and bomb out unless ENOBUFS? */
 + err = sock-ops-sendmsg(NULL, sock, msg, len);
 + if (err  0) {
 +

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Arnd Bergmann

On Monday 10 August 2009 20:10:44 Michael S. Tsirkin wrote:
 On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote:
  what is the difference between vhost_net_reset_owner(n)
  and vhost_net_set_socket(n, -1)?
 
 set socket to -1 will only stop the device.
 
 reset owner will let another process take over the device.
 It also needs to reset all parameters to make it safe for that
 other process, so in particular the device is stopped.

ok
 
 I tried explaining this in the header vhost.h - does the comment
 there help, or do I need to clarify it?

No, I just didn't get there yet.

 I had the impression that if there's no compat_ioctl,
 unlocked_ioctl will get called automatically. No?

It will issue a kernel warning but not call unlocked_ioctl,
so you need either a compat_ioctl method or list the numbers
in fs/compat_ioctl.c, which I try to avoid.

  Why do you need a kernel thread here? If the data transfer functions
  all get called from a guest intercept, shouldn't you already be
  in the right mm?
 
 several reasons :)
 - I get called under lock, so can't block
 - eventfd can be passed to another process, and I won't be in guest context 
 at all
 - this also gets called outside guest context from socket poll
 - vcpu is blocked while it's doing i/o. it is better to free it up
   as all the packet copying might take a while

Ok.

  I guess that this is where one could plug into macvlan directly, using
  sock_alloc_send_skb/memcpy_fromiovec/dev_queue_xmit directly,
  instead of filling a msghdr for each, if we want to combine this
  with the work I did on that.
 
 quite possibly. Or one can just bind a raw socket to macvlan :)

Right, that works as well, but may get more complicated once we
try to add zero-copy or other optimizations.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 0/2] vhost: a kernel-level virtio server

2009-08-12 Thread Arnd Bergmann

On Wednesday 12 August 2009, Gregory Haskins wrote:
  Are you saying SRIOV is a requirement, and I can either program the
  SRIOV adapter with a mac or use promis?  Or are you saying I can use
  SRIOV+programmed mac OR a regular nic + promisc (with a perf penalty).
  
  SRIOV is not a requirement. And you can also use a dedicated
  nic+programmed mac if you are so inclined.
 
 Makes sense.  Got it.
 
 I was going to add guest-to-guest to the test matrix, but I assume that
 is not supported with vhost unless you have something like a VEPA
 enabled bridge?
 

If I understand it correctly, you can at least connect a veth pair
to a bridge, right? Something like

   veth0 - veth1 - vhost - guest 1 
eth0 - br0-|
   veth2 - veth3 - vhost - guest 2
  
It's a bit more complicated than it need to be, but should work fine.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 0/2] vhost: a kernel-level virtio server

2009-08-12 Thread Arnd Bergmann

On Wednesday 12 August 2009, Michael S. Tsirkin wrote:
  If I understand it correctly, you can at least connect a veth pair
  to a bridge, right? Something like
  
 veth0 - veth1 - vhost - guest 1 
  eth0 - br0-|
 veth2 - veth3 - vhost - guest 2
 
 Heh, you don't need a bridge in this picture:
 
 guest 1 - vhost - veth0 - veth1 - vhost guest 2

Sure, but the setup I described is the one that I would expect
to see in practice because it gives you external connectivity.

Measuring two guests communicating over a veth pair is
interesting for finding the bottlenecks, but of little
practical relevance.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-12 Thread Arnd Bergmann

On Monday 10 August 2009, Michael S. Tsirkin wrote:

 +struct workqueue_struct *vhost_workqueue;

[nitpicking] This could be static. 

 +/* The virtqueue structure describes a queue attached to a device. */
 +struct vhost_virtqueue {
 + struct vhost_dev *dev;
 +
 + /* The actual ring of buffers. */
 + struct mutex mutex;
 + unsigned int num;
 + struct vring_desc __user *desc;
 + struct vring_avail __user *avail;
 + struct vring_used __user *used;
 + struct file *kick;
 + struct file *call;
 + struct file *error;
 + struct eventfd_ctx *call_ctx;
 + struct eventfd_ctx *error_ctx;
 +
 + struct vhost_poll poll;
 +
 + /* The routine to call when the Guest pings us, or timeout. */
 + work_func_t handle_kick;
 +
 + /* Last available index we saw. */
 + u16 last_avail_idx;
 +
 + /* Last index we used. */
 + u16 last_used_idx;
 +
 + /* Outstanding buffers */
 + unsigned int inflight;
 +
 + /* Is this blocked? */
 + bool blocked;
 +
 + struct iovec iov[VHOST_NET_MAX_SG];
 +
 +} cacheline_aligned;

We discussed this before, and I still think this could be directly derived
from struct virtqueue, in the same way that vring_virtqueue is derived from
struct virtqueue. That would make it possible for simple device drivers
to use the same driver in both host and guest, similar to how Ira Snyder
used virtqueues to make virtio_net run between two hosts running the
same code [1].

Ideally, I guess you should be able to even make virtio_net work in the
host if you do that, but that could bring other complexities.

Arnd 

[1] http://lkml.org/lkml/2009/2/23/353
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-12 Thread Arnd Bergmann

On Wednesday 12 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 12, 2009 at 07:03:22PM +0200, Arnd Bergmann wrote:
  We discussed this before, and I still think this could be directly derived
  from struct virtqueue, in the same way that vring_virtqueue is derived from
  struct virtqueue.
 
 I prefer keeping it simple. Much of abstraction in virtio is due to the
 fact that it needs to work on top of different hardware emulations:
 lguest,kvm, possibly others in the future.  vhost is always working on
 real hardware, using eventfd as the interface, so it does not need that.

Well, that was my point: virtio can already work on a number of abstractions,
so adding one more for vhost should not be too hard.

  That would make it possible for simple device drivers
  to use the same driver in both host and guest,
 
 I don't think so. For example, there's a callback field that gets
 invoked in guest when buffers are consumed.  It could be overloaded to
 mean buffers are available in host but you never handle both
 situations in the same way, so what's the point?

...
 
 As I pointed out earlier, most code in virtio net is asymmetrical: guest
 provides buffers, host consumes them.  Possibly, one could use virtio
 rings in a symmetrical way, but support of existing guest virtio net
 means there's almost no shared code.

The trick is to swap the virtqueues instead. virtio-net is actually
mostly symmetric in just the same way that the physical wires on a
twisted pair ethernet are symmetric (I like how that analogy fits).

virtio_net kicks the transmit virtqueue when it has data and
it kicks the receive queue when it has empty buffers to fill,
and it has callbacks when the two are done. You can do the
same in both the guest and the host, but then the guests input
virtqueue is the hosts output virtqueue and vice versa.

Once a virtqueue got kicked from both sides, the vhost_virtqueue
implementation between the two only needs to do a copy_from_user
or copy_to_user (possibly from a thread if it is in atomic context)
and then call the two callback functions. This is basically the
same thing you do already, except that you use slightly different
names for the components.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann

On Wednesday 12 August 2009, Anthony Liguori wrote:
 At any rate, I'd like to see performance results before we consider 
 trying to reuse virtio code.

Yes, I agree. I'd also like to do more work on the macvlan extensions
to see if it works out without involving a socket. Passing the socket
into the vhost_net device is a nice feature of the current implementation
that we'd have to give up for something else (e.g. making the vhost
a real network interface that you can hook up to a bridge) if it were
to use virtio.

Unless I can come up with a solution that is clearly superior, I'm
taking back my objections on that part for now.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann

On Thursday 13 August 2009, Arnd Bergmann wrote:
 Unfortunately, this also implies that you could no longer simply use the
 packet socket interface as you do currently, as I realized only now.
 This obviously has a significant impact on your user space interface.

Also, if we do the copy in the transport, it definitely means that we
can't get to zero-copy RX/TX from guest space any more. The current
vhost_net driver doesn't do that yet, but could be extended in the
same way that I'm hoping to do it for macvtap.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann

On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 12, 2009 at 07:59:47PM +0200, Arnd Bergmann wrote:
  The trick is to swap the virtqueues instead. virtio-net is actually
  mostly symmetric in just the same way that the physical wires on a
  twisted pair ethernet are symmetric (I like how that analogy fits).
 
 You need to really squint hard for it to look symmetric.
 
 For example, for RX, virtio allocates an skb, puts a descriptor on a
 ring and waits for host to fill it in. Host system can not do the same:
 guest does not have access to host memory.
 
 You can do a copy in transport to hide this fact, but it will kill
 performance.

Yes, that is what I was suggesting all along. The actual copy operation
has to be done by the host transport, which is obviously different from
the guest transport that just calls the host using vring_kick().

Right now, the number of copy operations in your code is the same.
You are doing the copy a little bit later in skb_copy_datagram_iovec(),
which is indeed a very nice hack. Changing to a virtqueue based method
would imply that the host needs to add each skb_frag_t to its outbound
virtqueue, which then gets copied into the guests inbound virtqueue.

Unfortunately, this also implies that you could no longer simply use the
packet socket interface as you do currently, as I realized only now.
This obviously has a significant impact on your user space interface.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann

On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 The best way to do this IMO would be to add zero copy support to raw
 sockets, vhost will then get it basically for free.

Yes, that would be nice. I wonder if that could lead to security
problems on TX though. I guess It will only bring significant performance
improvements if we leave the data writable in the user space or guest
during the operation. If the user finds the right timing, it could
modify the frame headers after they have been checked using netfilter,
or while the frames are being consumed in the kernel (e.g. an NFS
server running in a guest).

Ardn 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-13 Thread Arnd Bergmann

On Thursday 13 August 2009, Michael S. Tsirkin wrote:
  Right now, the number of copy operations in your code is the same.
  You are doing the copy a little bit later in skb_copy_datagram_iovec(),
  which is indeed a very nice hack. Changing to a virtqueue based method
  would imply that the host needs to add each skb_frag_t to its outbound
  virtqueue, which then gets copied into the guests inbound virtqueue.
 
 Which is a lot more code than just calling skb_copy_datagram_iovec.

Well, I don't see this part as much of a problem, because the code
already exists in virtio_net. If we really wanted to go down that road,
just using virtio_net would solve the problem of frame handling
entirely, but create new problems elsewhere, as we have mentioned.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-14 Thread Arnd Bergmann

On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

AFAICT, you have addressed all my comments, mostly by convincing me
that you got it right anyway ;-).

I hope this gets into 2.6.32, good work!

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Acked-by: Arnd Bergmann a...@arndb.de

One idea though:

 + /* Parameter checking */
 + if (sock-sk-sk_type != SOCK_RAW) {
 + r = -ESOCKTNOSUPPORT;
 + goto done;
 + }
 +
 + r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa,
 +uaddr_len, 0);
 + if (r)
 + goto done;
 +
 + if (uaddr.sa.sll_family != AF_PACKET) {
 + r = -EPFNOSUPPORT;
 + goto done;
 + }

You currently limit the scope of the driver by only allowing raw packet
sockets to be passed into the network driver. In qemu, we currently support
some very similar transports:

* raw packet (not in a release yet)
* tcp connection
* UDP multicast
* tap character device
* VDE with Unix local sockets

My primary interest right now is the tap support, but I think it would
be interesting in general to allow different file descriptor types
in vhost_net_set_socket. AFAICT, there are two major differences
that we need to handle for this:

* most of the transports are sockets, tap uses a character device.
  This could be dealt with by having both a struct socket * in
  struct vhost_net *and* a struct file *, or by always keeping the
  struct file and calling vfs_readv/vfs_writev for the data transport
  in both cases.

* Each transport has a slightly different header, we have
  - raw ethernet frames (raw, udp multicast, tap)
  - 32-bit length + raw frames, possibly fragmented (tcp)
  - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
  To handle these three cases, we need either different ioctl numbers
  so that vhost_net can choose the right one, or a flags field in
  VHOST_NET_SET_SOCKET, like

  #define VHOST_NET_RAW 1
  #define VHOST_NET_LEN_HDR 2
  #define VHOST_NET_VNET_HDR4

  struct vhost_net_socket {
unsigned int flags;
int fd;
  };
  #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)

If both of those are addressed, we can treat vhost_net as a generic
way to do network handling in the kernel independent of the qemu
model (raw, tap, ...) for it. 

Your qemu patch would have to work differently, so instead of 

qemu -net nic,vhost=eth0

you would do the same as today with the raw packet socket extension

qemu -net nic -net raw,ifname=eth0 

Qemu could then automatically try to use vhost_net, if it's available
in the kernel, or just fall back on software vlan otherwise.
Does that make sense?

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Arnd Bergmann

On Tuesday 18 August 2009, Gregory Haskins wrote:
 Avi Kivity wrote:
  On 08/17/2009 10:33 PM, Gregory Haskins wrote:
  
  One point of contention is that this is all managementy stuff and should
  be kept out of the host kernel.  Exposing shared memory, interrupts, and
  guest hypercalls can all be easily done from userspace (as virtio
  demonstrates).  True, some devices need kernel acceleration, but that's
  no reason to put everything into the host kernel.
 
 See my last reply to Anthony.  My two points here are that:
 
 a) having it in-kernel makes it a complete subsystem, which perhaps has
 diminished value in kvm, but adds value in most other places that we are
 looking to use vbus.
 
 b) the in-kernel code is being overstated as complex.  We are not
 talking about your typical virt thing, like an emulated ICH/PCI chipset.
  Its really a simple list of devices with a handful of attributes.  They
 are managed using established linux interfaces, like sysfs/configfs.

IMHO the complexity of the code is not so much of a problem. What I
see as a problem is the complexity a kernel/user space interface that
manages a the devices with global state.

One of the greatest features of Michaels vhost driver is that all
the state is associated with open file descriptors that either exist
already or belong to the vhost_net misc device. When a process dies,
all the file descriptors get closed and the whole state is cleaned
up implicitly.

AFAICT, you can't do that with the vbus host model.

  What performance oriented items have been left unaddressed?
 
 Well, the interrupt model to name one.

The performance aspects of your interrupt model are independent
of the vbus proxy, or at least they should be. Let's assume for
now that your event notification mechanism gives significant
performance improvements (which we can't measure independently
right now). I don't see a reason why we could not get the
same performance out of a paravirtual interrupt controller
that uses the same method, and it would be straightforward
to implement one and use that together with all the existing
emulated PCI devices and virtio devices including vhost_net.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Arnd Bergmann

On Tuesday 18 August 2009 20:35:22 Michael S. Tsirkin wrote:
 On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote:
  Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
  vhost-net capable of this?
  
  I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
  about devices they are working with that would benefit from something
  like virtio-over-PCI. I'd like to see vhost-net be merged with the
  capability to support my use case. There are plenty of others that would
  benefit, not just myself.

yes.

  I'm not sure vhost-net is being written with this kind of future use in
  mind. I'd hate to see it get merged, and then have to change the ABI to
  support physical-device-to-device usage. It would be better to keep
  future use in mind now, rather than try and hack it in later.
 
 I still need to think your usage over. I am not so sure this fits what
 vhost is trying to do. If not, possibly it's better to just have a
 separate driver for your device.

I now think we need both. virtio-over-PCI does it the right way for its
purpose and can be rather generic. It could certainly be extended to
support virtio-net on both sides (host and guest) of KVM, but I think
it better fits the use where a kernel wants to communicate with some
other machine where you normally wouldn't think of using qemu.

Vhost-net OTOH is great in the way that it serves as an easy way to
move the virtio-net code from qemu into the kernel, without changing
its behaviour. It should even straightforward to do live-migration between
hosts with and without it, something that would be much harder with
the virtio-over-PCI logic. Also, its internal state is local to the
process owning its file descriptor, which makes it much easier to
manage permissions and cleanup of its resources.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 203 matches

Mail list logo