Re: presentation at kvm forum and pagefaults

2019-10-31 Thread Michael S. Tsirkin
Regarding the presentation I gave at the kvm forum
on pagefaults.

Two points:


1. pagefaults are important not just for migration.
They are important for performance features such as
autonuma and huge pages, since this relies on moving
pages around.
Migration can maybe be solved by switch to software but
this is not a good solution for numa and thp  since
at a given time some page is likely being moved.




2.  For devices such as networking RX order in which buffers are
used *does not matter*.
Thus if a device gets a fault in response to attempt to store a buffer
into memory, it can just re-try, using the next buffer in queue instead.

This works because normally buffers can be used out of order by device.

The faulted buffer will be reused by another buffer when driver notifies
device page has been faulted in.

Note buffers are processed by buffer in the order in which they have
been used, *not* the order in which they have been put in the queue.  So
this will *not* cause any packet reordering for the driver.

Packets will only get dropped if all buffers are swapped
out, which should be rare with a large RX queue.


As I said at the forum, a side buffer for X packets
to be stored temporarily is also additionally possible. But with the above
it is no longer strictly required.


This conflicts with the IN_ORDER feature flag, I guess we will have to
re-think this flag then. If we do feel we need to salvage IN_ORDER as is,
maybe device can use the buffer with length 0 and driver will re-post it
later, but I am not I am not sure about this since involving the VF
driver seems inelegant.


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4] vhost: introduce mdev based hardware backend

2019-10-31 Thread Tiwei Bie
This patch introduces a mdev based hardware vhost backend.
This backend is built on top of the same abstraction used
in virtio-mdev and provides a generic vhost interface for
userspace to accelerate the virtio devices in guest.

This backend is implemented as a mdev device driver on top
of the same mdev device ops used in virtio-mdev but using
a different mdev class id, and it will register the device
as a VFIO device for userspace to use. Userspace can setup
the IOMMU with the existing VFIO container/group APIs and
then get the device fd with the device name. After getting
the device fd of this device, userspace can use vhost ioctls
to setup the backend.

Signed-off-by: Tiwei Bie 
---
This patch depends on below series:
https://lkml.org/lkml/2019/10/30/62

v3 -> v4:
- Rebase on top of virtio-mdev series v6;
- Some minor tweaks and improvements;

v2 -> v3:
- Fix the return value (Jason);
- Don't cache unnecessary information in vhost-mdev (Jason);
- Get rid of the memset in open (Jason);
- Add comments for VHOST_SET_MEM_TABLE, ... (Jason);
- Filter out unsupported features in vhost-mdev (Jason);
- Add _GET_DEVICE_ID ioctl (Jason);
- Add _GET_CONFIG/_SET_CONFIG ioctls (Jason);
- Drop _GET_QUEUE_NUM ioctl (Jason);
- Fix the copy-paste errors in _IOW/_IOR usage;
- Some minor fixes and improvements;

v1 -> v2:
- Replace _SET_STATE with _SET_STATUS (MST);
- Check status bits at each step (MST);
- Report the max ring size and max number of queues (MST);
- Add missing MODULE_DEVICE_TABLE (Jason);
- Only support the network backend w/o multiqueue for now;
- Some minor fixes and improvements;
- Rebase on top of virtio-mdev series v4;

RFC v4 -> v1:
- Implement vhost-mdev as a mdev device driver directly and
  connect it to VFIO container/group. (Jason);
- Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
  meaningless HVA->GPA translations (Jason);

RFC v3 -> RFC v4:
- Build vhost-mdev on top of the same abstraction used by
  virtio-mdev (Jason);
- Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);

RFC v2 -> RFC v3:
- Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
  based vhost protocol on top of vfio-mdev (Jason);

RFC v1 -> RFC v2:
- Introduce a new VFIO device type to build a vhost protocol
  on top of vfio-mdev;

 drivers/vfio/mdev/mdev_core.c|  20 ++
 drivers/vfio/mdev/mdev_private.h |   1 +
 drivers/vhost/Kconfig|  12 +
 drivers/vhost/Makefile   |   3 +
 drivers/vhost/mdev.c | 556 +++
 include/linux/mdev.h |   5 +
 include/uapi/linux/vhost.h   |  18 +
 include/uapi/linux/vhost_types.h |   8 +
 8 files changed, 623 insertions(+)
 create mode 100644 drivers/vhost/mdev.c

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 22ca589750d8..109dbac01a8f 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -96,6 +96,26 @@ mdev_get_virtio_ops(struct mdev_device *mdev)
 }
 EXPORT_SYMBOL(mdev_get_virtio_ops);
 
+/* Specify the vhost device ops for the mdev device, this
+ * must be called during create() callback for vhost mdev device.
+ */
+void mdev_set_vhost_ops(struct mdev_device *mdev,
+   const struct virtio_mdev_device_ops *vhost_ops)
+{
+   mdev_set_class(mdev, MDEV_CLASS_ID_VHOST);
+   mdev->vhost_ops = vhost_ops;
+}
+EXPORT_SYMBOL(mdev_set_vhost_ops);
+
+/* Get the vhost device ops for the mdev device. */
+const struct virtio_mdev_device_ops *
+mdev_get_vhost_ops(struct mdev_device *mdev)
+{
+   WARN_ON(mdev->class_id != MDEV_CLASS_ID_VHOST);
+   return mdev->vhost_ops;
+}
+EXPORT_SYMBOL(mdev_get_vhost_ops);
+
 struct device *mdev_dev(struct mdev_device *mdev)
 {
return >dev;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7b47890c34e7..5597c846e52f 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -40,6 +40,7 @@ struct mdev_device {
union {
const struct vfio_mdev_device_ops *vfio_ops;
const struct virtio_mdev_device_ops *virtio_ops;
+   const struct virtio_mdev_device_ops *vhost_ops;
};
 };
 
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..062cada28f89 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,18 @@ config VHOST_VSOCK
To compile this driver as a module, choose M here: the module will be 
called
vhost_vsock.
 
+config VHOST_MDEV
+   tristate "Vhost driver for Mediated devices"
+   depends on EVENTFD && VFIO && VFIO_MDEV
+   select VHOST
+   default n
+   ---help---
+   This kernel module can be loaded in host kernel to accelerate
+   guest virtio devices with the mediated device based backends.
+
+   To compile this driver as a module, choose M here: the module will
+   be called vhost_mdev.
+
 config VHOST
tristate
---help---

Re: [PATCH net-next 11/14] vsock: add multi-transports support

2019-10-31 Thread Stefano Garzarella
On Wed, Oct 30, 2019 at 03:40:05PM +, Jorgen Hansen wrote:
> > From: Stefano Garzarella [mailto:sgarz...@redhat.com]
> > > +/* Assign a transport to a socket and call the .init transport callback.
> > > + *
> > > + * Note: for stream socket this must be called when vsk->remote_addr
> > > +is set
> > > + * (e.g. during the connect() or when a connection request on a
> > > +listener
> > > + * socket is received).
> > > + * The vsk->remote_addr is used to decide which transport to use:
> > > + *  - remote CID > VMADDR_CID_HOST will use host->guest transport
> > > + *  - remote CID <= VMADDR_CID_HOST will use guest->host transport
> > > +*/ int vsock_assign_transport(struct vsock_sock *vsk, struct
> > > +vsock_sock *psk) {
> > > +   const struct vsock_transport *new_transport;
> > > +   struct sock *sk = sk_vsock(vsk);
> > > +
> > > +   switch (sk->sk_type) {
> > > +   case SOCK_DGRAM:
> > > +   new_transport = transport_dgram;
> > > +   break;
> > > +   case SOCK_STREAM:
> > > +   if (vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
> > > +   new_transport = transport_h2g;
> > > +   else
> > > +   new_transport = transport_g2h;
> > 
> > I just noticed that this break the loopback in the guest.
> > As a fix, we should use 'transport_g2h' when remote_cid <=
> > VMADDR_CID_HOST or remote_cid is the id of 'transport_g2h'.
> > 
> > To do that we also need to avoid that L2 guests can have the same CID of L1.
> > For vhost_vsock I can call vsock_find_cid() in vhost_vsock_set_cid()
> > 
> > @Jorgen: for vmci we need to do the same? or it is guaranteed, since it's
> > already support nested VMs, that a L2 guests cannot have the same CID as
> > the L1.
> 
> As far as I can tell, we have the same issue with the current support for 
> nested VMs in
> VMCI. If we have an L2 guest with the same CID as the L1 guest, we will 
> always send to
> the L2 guest, and we may assign an L2 guest the same CID as L1. It should be 
> straight
> forward to avoid this, though.
> 

Yes, I think so.

For the v2 I'm exposing the vsock_find_cid() to the transports, in this
way I can reject requests to set the same L1 CID for L2 guests.

Thanks,
Stefano
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization