Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-21 Thread Gleb Natapov
On Thu, Aug 20, 2009 at 04:38:17PM +0300, Michael S. Tsirkin wrote:
 On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
  On Thursday 20 August 2009, Michael S. Tsirkin wrote:
   On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
  On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
 
  Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
  VHOST_SET_OWNER, VHOST_RESET_OWNER
 
 SET/RESET OWNER is still needed: otherwise if you share a descriptor
 with another process, it can corrupt your memory.

How? The point of using user threads is that you only ever access the
address space of the thread that called the ioctl.
   
   Think about this example with processes A and B sharing an fd:
   A does SET_USED_ADDRESS
   B does SET_USED_ADDRESS
   A does VHOST_NET_SPLICE
   See how stuff gets written into a random place in memory of A?
  
  Yes, I didn't think of that. It doesn't seem like a big problem
  though, because it's a clear misuse of the API (I guess your
  current code returns an error for one of the SET_USED_ADDRESS
  ioctls), so I would see it as a classic garbage-in garbage-out
  case.
  
  It may even work in the case that the sharing of the fd resulted
  from a fork, where the address contains the same buffer in both
  processes. I can't think of a reason why you would want to use
  it like that though.
 
 It doesn't matter that I don't want this: allowing 1 process corrupt
 another's memory is a security issue.  Once you get an fd, you want to
 be able to use it without worrying that a bug in another process will
 crash yours.
 
If B's SET_USED_ADDRESS fails how one process can corrupt a memory of
other process?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-20 Thread Arnd Bergmann
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
  On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
   On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
   
Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
VHOST_SET_OWNER, VHOST_RESET_OWNER
   
   SET/RESET OWNER is still needed: otherwise if you share a descriptor
   with another process, it can corrupt your memory.
  
  How? The point of using user threads is that you only ever access the
  address space of the thread that called the ioctl.
 
 Think about this example with processes A and B sharing an fd:
 A does SET_USED_ADDRESS
 B does SET_USED_ADDRESS
 A does VHOST_NET_SPLICE
 See how stuff gets written into a random place in memory of A?

Yes, I didn't think of that. It doesn't seem like a big problem
though, because it's a clear misuse of the API (I guess your
current code returns an error for one of the SET_USED_ADDRESS
ioctls), so I would see it as a classic garbage-in garbage-out
case.

It may even work in the case that the sharing of the fd resulted
from a fork, where the address contains the same buffer in both
processes. I can't think of a reason why you would want to use
it like that though.

  Why would I wake up the threads spuriously? Do you mean for
  stopping the transmission or something else? I guess a pthread_kill
  would be enough for shutting it down.
 
 If you kill and restart them you lost priority etc parameters, but maybe.

If you want to restart it, just send a nonfatal signal (SIGUSR1,
SIGRTMIN, ...) instead of a SIGKILL.

   More importantly, you lose control of CPU locality.  Simply put, a
   natural threading model in virtualization is one thread per guest vcpu.
   Asking applications to add multiple helper threads just so they can
   block forever is wrong, IMO, as userspace has no idea which CPU
   they should be on, what priority to use, etc.
  
  But the kernel also doesn't know this, you get the same problem in
  another way. If you have multiple guests running at different priorities,
  the kernel will use those priorities to do the more important transfers
  first, while with a global workqueue every guest gets the same priority.
 
 We could create more threads if this becomes a problem. I just think it
 should be transparent to userspace. Possibly it's useful to look at the
 packet header as well to decide on priority: this is something userspace
 can't do.

Being transparent to user space would be nice, I agree. Letting user space
choose would also be nice, e.g. if you want to distribute eight available
hardware queue pairs to three guests in a non-obvious way. The
implementation depends to some degree on how we want to do multiqueue
RX/TX in virtio-net in the long run. For best cache locality and NUMA
behaviour, we might want to have one virtqueue per guest CPU and control
them independently from the host.

Priorities of the packets are dealt with in the packet scheduler for
external interfaces, I guess that is sufficient. I'm not sure if we
need to honor the same priorities for guest-to-guest communication,
my feeling is that we don't need to.

  You say that the natural model is to have one thread per guest
  CPU,
 
 Sorry I was not clear. I think userspace should create thread per guest.
 We can create as many as we need for networking but I think this should
 be behind the scenes, so userspace shouldn't bother with host CPUs, it
 will just get it wrong. Think of CPU hotplug, interrupts migrating
 between CPUs, etc ...

Yes, I hope we can avoid letting qemu know about host CPUs.
I'm not sure we can avoid it completely, because something needs
to set physical IRQ affinity and such for the virtual devices
if you want to get the best locality.

  but you have a thread per host CPU instead. If the numbers
  are different, you probably lose either way.
 
 The trick I used is to keep as much as possible local
 TX done on the CPU that runs the guest,
 RX done on the CPU that runs the NIC interrupt.
 a smart SMP guest sees which cpu gets interrupts
 from NIC and schedules RX there, and it shouldn't matter
 if the numbers of CPUs are different.

yes, that sounds good.

  It gets worse if you try to apply NUMA policies.
 
 I believe the best we could do is avoid switching CPUs
 until we know the actual destination.

My point is that the RX data in the guest address space
should be on the same NUMA node that gets the interrupt.

   And note that there might be more than one error.  I guess, that's
   another problem with trying to layer on top of vfs.
  
  Why is that different from any other system call?
 
 With other system calls nothing happens while you process the error.
 Here, the guest (other queues) and the network keep running (unless
 there is a thread per queue, maybe we can block a queue, but we both
 agreed above we don't want that).


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-20 Thread Arnd Bergmann
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
 On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
  On Thursday 20 August 2009, Michael S. Tsirkin wrote:

 It doesn't matter that I don't want this: allowing 1 process corrupt
 another's memory is a security issue.  Once you get an fd, you want to
 be able to use it without worrying that a bug in another process will
 crash yours.

Ok, got it. Yes, that would be inacceptable.

   If you assume losing the code for the second error condition is OK, why
   is the first one so important?  That's why I used a counter (eventfd)
   per virtqueue, on error userspace can scan the ring and poll the socket
   and discover what's wrong, and counter ensures we can detect that error
   happened while we were not looking.
  
  I guess we were talking about different kinds of errors here, and I'm
  still not sure which one you are talking about.
  
 Non fatal errors. E.g. translation errors probably should be
 non-fatal. I can also imagine working around guest bugs in
 userspace.

Ah, so I guess the confusion was that I was worried about
errors coming from the socket file descriptor, while you
were thinking of errors from the guest side, which I did not
expect to happen.

The errors from the socket (or chardev, as that was the
start of the argument) should still fit into the categories
that I mentioned, either they can be handled by the host
kernel, or they are fatal.

I'll read up in your code to see how you handle asynchronous
non-fatal errors from the guest. Intuitively, I'd still
assume that returning the first error should be enough
because it will typically mean that you cannot continue
without fixing it up first, and you might get the next
error immediately after that.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-20 Thread Michael S. Tsirkin
On Thu, Aug 20, 2009 at 04:31:36PM +0200, Arnd Bergmann wrote:
 On Thursday 20 August 2009, Michael S. Tsirkin wrote:
  On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
   On Thursday 20 August 2009, Michael S. Tsirkin wrote:
 
  It doesn't matter that I don't want this: allowing 1 process corrupt
  another's memory is a security issue.  Once you get an fd, you want to
  be able to use it without worrying that a bug in another process will
  crash yours.
 
 Ok, got it. Yes, that would be inacceptable.
 
If you assume losing the code for the second error condition is OK, why
is the first one so important?  That's why I used a counter (eventfd)
per virtqueue, on error userspace can scan the ring and poll the socket
and discover what's wrong, and counter ensures we can detect that error
happened while we were not looking.
   
   I guess we were talking about different kinds of errors here, and I'm
   still not sure which one you are talking about.
   
  Non fatal errors. E.g. translation errors probably should be
  non-fatal. I can also imagine working around guest bugs in
  userspace.
 
 Ah, so I guess the confusion was that I was worried about
 errors coming from the socket file descriptor, while you
 were thinking of errors from the guest side, which I did not
 expect to happen.
 
 The errors from the socket (or chardev, as that was the
 start of the argument) should still fit into the categories
 that I mentioned, either they can be handled by the host
 kernel, or they are fatal.

Hmm, are you sure? Imagine a device going away while socket is bound to
it.  You get -ENXIO. It's not fatal in a sense that you can bind the
socket to another device and go on, right?

 I'll read up in your code to see how you handle asynchronous
 non-fatal errors from the guest. Intuitively, I'd still
 assume that returning the first error should be enough
 because it will typically mean that you cannot continue
 without fixing it up first, and you might get the next
 error immediately after that.
 
   Arnd 

Yes, but need to be careful not to lose that next error, and these
errors only block one queue. I handle this by reporting them on an
eventfd, per vq, which userspace can poll. 

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-20 Thread Arnd Bergmann
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
 
  The errors from the socket (or chardev, as that was the
  start of the argument) should still fit into the categories
  that I mentioned, either they can be handled by the host
  kernel, or they are fatal.
 
 Hmm, are you sure? Imagine a device going away while socket is bound to
 it.  You get -ENXIO. It's not fatal in a sense that you can bind the
 socket to another device and go on, right?

Right. Not fatal in that sense, but fatal in the sense that I
can no longer transmit other frames until you recover. I think
we both meant the same here.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-19 Thread Arnd Bergmann
On Sunday 16 August 2009, Michael S. Tsirkin wrote:
 On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
  
  * most of the transports are sockets, tap uses a character device.
This could be dealt with by having both a struct socket * in
struct vhost_net *and* a struct file *, or by always keeping the
struct file and calling vfs_readv/vfs_writev for the data transport
in both cases.
 
 I am concerned that character devices might have weird side effects with
 read/write operations and that calling them from kernel thread the way I
 do might have security implications. Can't point at anything specific
 though at the moment.

I understand your feelings about passing a chardev fd into your driver
and I agree that we need to be very careful if we want to allow it.

Maybe we could instead extend the 'splice' system call to work on a
vhost_net file descriptor. If we do that, we can put the access back
into a user thread (or two) that stays in splice indefinetely to
avoid some of the implications of kernel threads like the missing
ability to handle transfer errors in user space.

 I wonder - can we expose the underlying socket used by tap, or will that
 create complex lifetime issues?

I think this could get more messy in the long run than calling vfs_readv
on a random fd. It would mean deep internal knowledge of the tap driver
in vhost_net, which I really would prefer to avoid.

  * Each transport has a slightly different header, we have
- raw ethernet frames (raw, udp multicast, tap)
- 32-bit length + raw frames, possibly fragmented (tcp)
- 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
To handle these three cases, we need either different ioctl numbers
so that vhost_net can choose the right one, or a flags field in
VHOST_NET_SET_SOCKET, like
  
#define VHOST_NET_RAW 1
#define VHOST_NET_LEN_HDR 2
#define VHOST_NET_VNET_HDR4
  
struct vhost_net_socket {
  unsigned int flags;
  int fd;
};
#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct 
  vhost_net_socket)
 
 It seems we can query the socket to find out the type, 

yes, I understand that you can do that, but I still think that decision
should be left to user space. Adding a length header for TCP streams but
not for UDP is something that we would normally want to do, but IMHO
vhost_net should not need to know about this.

 or use the features ioctl.

Right, I had forgotten about that one. It's probably equivalent
to the flags I suggested, except that one allows you to set features
after starting the communication, while the other one prevents
you from doing that.

  Qemu could then automatically try to use vhost_net, if it's available
  in the kernel, or just fall back on software vlan otherwise.
  Does that make sense?
 
 I agree, long term it should be enabled automatically when possible.

So how about making the qemu command line interface an extension to
what Or Gerlitz has done for the raw packet sockets?

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-19 Thread Michael S. Tsirkin
On Wed, Aug 19, 2009 at 11:04:50AM +0200, Arnd Bergmann wrote:
 On Sunday 16 August 2009, Michael S. Tsirkin wrote:
  On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
   
   * most of the transports are sockets, tap uses a character device.
 This could be dealt with by having both a struct socket * in
 struct vhost_net *and* a struct file *, or by always keeping the
 struct file and calling vfs_readv/vfs_writev for the data transport
 in both cases.
  
  I am concerned that character devices might have weird side effects with
  read/write operations and that calling them from kernel thread the way I
  do might have security implications. Can't point at anything specific
  though at the moment.
 
 I understand your feelings about passing a chardev fd into your driver
 and I agree that we need to be very careful if we want to allow it.
 
 Maybe we could instead extend the 'splice' system call to work on a
 vhost_net file descriptor.  If we do that, we can put the access back
 into a user thread (or two) that stays in splice indefinetely

An issue with exposing internal threading model to userspace
in this way is that we lose control of e.g. CPU locality -
and it is very hard for userspace to get it right.

 to
 avoid some of the implications of kernel threads like the missing
 ability to handle transfer errors in user space.

Are you talking about TCP here?
Transfer errors are typically asynchronous - possibly eventfd
as I expose for vhost net is sufficient there.

  I wonder - can we expose the underlying socket used by tap, or will that
  create complex lifetime issues?
 
 I think this could get more messy in the long run than calling vfs_readv
 on a random fd. It would mean deep internal knowledge of the tap driver
 in vhost_net, which I really would prefer to avoid.

No, what I had in mind is adding a GET_SOCKET ioctl to tap.
vhost would then just use the socket.

   * Each transport has a slightly different header, we have
 - raw ethernet frames (raw, udp multicast, tap)
 - 32-bit length + raw frames, possibly fragmented (tcp)
 - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
 To handle these three cases, we need either different ioctl numbers
 so that vhost_net can choose the right one, or a flags field in
 VHOST_NET_SET_SOCKET, like
   
 #define VHOST_NET_RAW   1
 #define VHOST_NET_LEN_HDR   2
 #define VHOST_NET_VNET_HDR  4
   
 struct vhost_net_socket {
 unsigned int flags;
 int fd;
 };
 #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct 
   vhost_net_socket)
  
  It seems we can query the socket to find out the type, 
 
 yes, I understand that you can do that, but I still think that decision
 should be left to user space. Adding a length header for TCP streams but
 not for UDP is something that we would normally want to do, but IMHO
 vhost_net should not need to know about this.
 
  or use the features ioctl.
 
 Right, I had forgotten about that one. It's probably equivalent
 to the flags I suggested, except that one allows you to set features
 after starting the communication, while the other one prevents
 you from doing that.
 
   Qemu could then automatically try to use vhost_net, if it's available
   in the kernel, or just fall back on software vlan otherwise.
   Does that make sense?
  
  I agree, long term it should be enabled automatically when possible.
 
 So how about making the qemu command line interface an extension to
 what Or Gerlitz has done for the raw packet sockets?
 
   Arnd 

Not sure I see the connection, but I have not thought about qemu
side of things too much yet - trying to get kernel bits in place
first so that there's a stable ABI to work with.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-19 Thread Arnd Bergmann
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
  Maybe we could instead extend the 'splice' system call to work on a
  vhost_net file descriptor.  If we do that, we can put the access back
  into a user thread (or two) that stays in splice indefinetely
 
 An issue with exposing internal threading model to userspace
 in this way is that we lose control of e.g. CPU locality -
 and it is very hard for userspace to get it right.

Good point, I hadn't thought about that in this context.

For macvtap, my idea was to open the same tap char device multiple
times and use each fd exclusively on one *guest* CPU. I'm not sure
if virtio-net can already handle SMP guests efficiently. We might
actually need to extend it to have more pairs of virtqueues, one
for each guest CPU, which can then be bound to a host queue (or queue
pair) in the physical nic.

Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
VHOST_SET_OWNER, VHOST_RESET_OWNER and your kernel thread with a new
VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
context of the calling thread.

This would improve the driver on various fronts:

- no need for playing tricks with use_mm/unuse_mm
- possibly fewer global TLB flushes from switch_mm, which
  may improve performance.
- ability to pass down error codes from socket or guest to
  user space by returning from ioctl
- based on that, the ability to use any kind of file
  descriptor that can do writev/readv or sendmsg/recvmsg
  without the nastiness you mentioned.

The disadvantage of course is that you need to add a user
thread for each guest device to make up for the workqueue
that you save.

  to
  avoid some of the implications of kernel threads like the missing
  ability to handle transfer errors in user space.
 
 Are you talking about TCP here?
 Transfer errors are typically asynchronous - possibly eventfd
 as I expose for vhost net is sufficient there.

I mean errors in general if we allow random file descriptors to be used.
E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
EIO, EAGAIN and possibly others. We can handle some in kernel, others
should never happen with vhost_net, but if something unexpected happens
it would be nice to just bail out to user space.

   I wonder - can we expose the underlying socket used by tap, or will that
   create complex lifetime issues?
  
  I think this could get more messy in the long run than calling vfs_readv
  on a random fd. It would mean deep internal knowledge of the tap driver
  in vhost_net, which I really would prefer to avoid.
 
 No, what I had in mind is adding a GET_SOCKET ioctl to tap.
 vhost would then just use the socket.

Right, that would work with tun/tap at least. It sounds a bit fishy
but I can't see a reason why it would be hard to do.
I'd have to think about how to get it working with macvtap, or if
there is much value left in macvtap after that anyway.

  So how about making the qemu command line interface an extension to
  what Or Gerlitz has done for the raw packet sockets?

 Not sure I see the connection, but I have not thought about qemu
 side of things too much yet - trying to get kernel bits in place
 first so that there's a stable ABI to work with.

Ok, fair enough. The kernel bits are obviously more time critical
right now, since they should get into 2.6.32.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-19 Thread Arnd Bergmann
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
 On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
  On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
 
  Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
  VHOST_SET_OWNER, VHOST_RESET_OWNER
 
 SET/RESET OWNER is still needed: otherwise if you share a descriptor
 with another process, it can corrupt your memory.

How? The point of using user threads is that you only ever access the
address space of the thread that called the ioctl.

  and your kernel thread with a new
  VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
  context of the calling thread.
 
 For one, you'd want a thread per virtqueue.

Don't understand why. The thread can be blocked on all four ends
of the device and wake up whenever there is some work do to in
any direction. If we have data to be transferred in both ways,
we save one thread switch. It probably won't hurt much to have one
thread per direction, but I don't see the point.

 Second, an incoming traffic might arrive on another CPU, we want
 to keep it local.  I guess you would also want ioctls to wake up
 the threads spuriously ...

Outbound traffic should just stay on whatever CPU was sending it
from the guest. For inbound traffic, we should only wake up
the thread on the CPU that got the data to start with.

Why would I wake up the threads spuriously? Do you mean for
stopping the transmission or something else? I guess a pthread_kill
would be enough for shutting it down.

  This would improve the driver on various fronts:
  
  - no need for playing tricks with use_mm/unuse_mm
  - possibly fewer global TLB flushes from switch_mm, which
may improve performance.
 
 Why would there be less flushes?

I just read up on task-active_mm handling. There probably wouldn't
be any. I got that wrong.

  - based on that, the ability to use any kind of file
descriptor that can do writev/readv or sendmsg/recvmsg
without the nastiness you mentioned.
 
 Yes, it's an interesting approach. As I said, need to tread very
 carefully though, I don't think all issues are figured out. For example:
 what happens if we pass our own fd here? Will refcount on file ever get
 to 0 on exit?  There may be others ...

right.

  The disadvantage of course is that you need to add a user
  thread for each guest device to make up for the workqueue
  that you save.
 
 More importantly, you lose control of CPU locality.  Simply put, a
 natural threading model in virtualization is one thread per guest vcpu.
 Asking applications to add multiple helper threads just so they can
 block forever is wrong, IMO, as userspace has no idea which CPU
 they should be on, what priority to use, etc.

But the kernel also doesn't know this, you get the same problem in
another way. If you have multiple guests running at different priorities,
the kernel will use those priorities to do the more important transfers
first, while with a global workqueue every guest gets the same priority.

You say that the natural model is to have one thread per guest
CPU, but you have a thread per host CPU instead. If the numbers
are different, you probably lose either way. It gets worse if you
try to apply NUMA policies.
 
to
avoid some of the implications of kernel threads like the missing
ability to handle transfer errors in user space.
   
   Are you talking about TCP here?
   Transfer errors are typically asynchronous - possibly eventfd
   as I expose for vhost net is sufficient there.
  
  I mean errors in general if we allow random file descriptors to be used.
  E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
  EIO, EAGAIN and possibly others. We can handle some in kernel, others
  should never happen with vhost_net, but if something unexpected happens
  it would be nice to just bail out to user space.
 
 And note that there might be more than one error.  I guess, that's
 another problem with trying to layer on top of vfs.

Why is that different from any other system call? We just return when
we hit the first error condition.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-16 Thread Michael S. Tsirkin
On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
 On Thursday 13 August 2009, Michael S. Tsirkin wrote:
  What it is: vhost net is a character device that can be used to reduce
  the number of system calls involved in virtio networking.
  Existing virtio net code is used in the guest without modification.
 
 AFAICT, you have addressed all my comments, mostly by convincing me
 that you got it right anyway ;-).
 
 I hope this gets into 2.6.32, good work!
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 Acked-by: Arnd Bergmann a...@arndb.de
 
 One idea though:
 
  +   /* Parameter checking */
  +   if (sock-sk-sk_type != SOCK_RAW) {
  +   r = -ESOCKTNOSUPPORT;
  +   goto done;
  +   }
  +
  +   r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa,
  +  uaddr_len, 0);
  +   if (r)
  +   goto done;
  +
  +   if (uaddr.sa.sll_family != AF_PACKET) {
  +   r = -EPFNOSUPPORT;
  +   goto done;
  +   }
 
 You currently limit the scope of the driver by only allowing raw packet
 sockets to be passed into the network driver. In qemu, we currently support
 some very similar transports:
 
 * raw packet (not in a release yet)
 * tcp connection
 * UDP multicast
 * tap character device
 * VDE with Unix local sockets
 
 My primary interest right now is the tap support, but I think it would
 be interesting in general to allow different file descriptor types
 in vhost_net_set_socket. AFAICT, there are two major differences
 that we need to handle for this:
 
 * most of the transports are sockets, tap uses a character device.
   This could be dealt with by having both a struct socket * in
   struct vhost_net *and* a struct file *, or by always keeping the
   struct file and calling vfs_readv/vfs_writev for the data transport
   in both cases.

I am concerned that character devices might have weird side effects with
read/write operations and that calling them from kernel thread the way I
do might have security implications. Can't point at anything specific
though at the moment.
I wonder - can we expose the underlying socket used by tap, or will that
create complex lifetime issues?

 * Each transport has a slightly different header, we have
   - raw ethernet frames (raw, udp multicast, tap)
   - 32-bit length + raw frames, possibly fragmented (tcp)
   - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
   To handle these three cases, we need either different ioctl numbers
   so that vhost_net can choose the right one, or a flags field in
   VHOST_NET_SET_SOCKET, like
 
   #define VHOST_NET_RAW   1
   #define VHOST_NET_LEN_HDR   2
   #define VHOST_NET_VNET_HDR  4
 
   struct vhost_net_socket {
   unsigned int flags;
   int fd;
   };
   #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct 
 vhost_net_socket)

It seems we can query the socket to find out the type, or use the
features ioctl.

 If both of those are addressed, we can treat vhost_net as a generic
 way to do network handling in the kernel independent of the qemu
 model (raw, tap, ...) for it. 
 
 Your qemu patch would have to work differently, so instead of 
 
 qemu -net nic,vhost=eth0
 
 you would do the same as today with the raw packet socket extension
 
 qemu -net nic -net raw,ifname=eth0 
 
 Qemu could then automatically try to use vhost_net, if it's available
 in the kernel, or just fall back on software vlan otherwise.
 Does that make sense?
 
   Arnd 

I agree, long term it should be enabled automatically when possible.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-14 Thread Arnd Bergmann
On Thursday 13 August 2009, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

AFAICT, you have addressed all my comments, mostly by convincing me
that you got it right anyway ;-).

I hope this gets into 2.6.32, good work!

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Acked-by: Arnd Bergmann a...@arndb.de

One idea though:

 + /* Parameter checking */
 + if (sock-sk-sk_type != SOCK_RAW) {
 + r = -ESOCKTNOSUPPORT;
 + goto done;
 + }
 +
 + r = sock-ops-getname(sock, (struct sockaddr *)uaddr.sa,
 +uaddr_len, 0);
 + if (r)
 + goto done;
 +
 + if (uaddr.sa.sll_family != AF_PACKET) {
 + r = -EPFNOSUPPORT;
 + goto done;
 + }

You currently limit the scope of the driver by only allowing raw packet
sockets to be passed into the network driver. In qemu, we currently support
some very similar transports:

* raw packet (not in a release yet)
* tcp connection
* UDP multicast
* tap character device
* VDE with Unix local sockets

My primary interest right now is the tap support, but I think it would
be interesting in general to allow different file descriptor types
in vhost_net_set_socket. AFAICT, there are two major differences
that we need to handle for this:

* most of the transports are sockets, tap uses a character device.
  This could be dealt with by having both a struct socket * in
  struct vhost_net *and* a struct file *, or by always keeping the
  struct file and calling vfs_readv/vfs_writev for the data transport
  in both cases.

* Each transport has a slightly different header, we have
  - raw ethernet frames (raw, udp multicast, tap)
  - 32-bit length + raw frames, possibly fragmented (tcp)
  - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
  To handle these three cases, we need either different ioctl numbers
  so that vhost_net can choose the right one, or a flags field in
  VHOST_NET_SET_SOCKET, like

  #define VHOST_NET_RAW 1
  #define VHOST_NET_LEN_HDR 2
  #define VHOST_NET_VNET_HDR4

  struct vhost_net_socket {
unsigned int flags;
int fd;
  };
  #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)

If both of those are addressed, we can treat vhost_net as a generic
way to do network handling in the kernel independent of the qemu
model (raw, tap, ...) for it. 

Your qemu patch would have to work differently, so instead of 

qemu -net nic,vhost=eth0

you would do the same as today with the raw packet socket extension

qemu -net nic -net raw,ifname=eth0 

Qemu could then automatically try to use vhost_net, if it's available
in the kernel, or just fall back on software vlan otherwise.
Does that make sense?

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html