[virtio-dev] Re: [PATCH v2 0/6] Extend vhost-user to support VFIO based accelerators

2018-03-23 Thread Tiwei Bie
On Thu, Mar 22, 2018 at 04:55:39PM +0200, Michael S. Tsirkin wrote:
> On Mon, Mar 19, 2018 at 03:15:31PM +0800, Tiwei Bie wrote:
> > This patch set does some small extensions to vhost-user protocol
> > to support VFIO based accelerators, and makes it possible to get
> > the similar performance of VFIO based PCI passthru while keeping
> > the virtio device emulation in QEMU.
> 
> I love your patches!
> Yet there are some things to improve.
> Posting comments separately as individual messages.
> 

Thank you so much! :-)

It may take me some time to address all your comments.
They're really helpful! I'll try to address and reply
to these comments in the next few days. Thanks again!
I do appreciate it!

Best regards,
Tiwei Bie

> 
> > How does accelerator accelerate vhost (data path)
> > =
> > 
> > Any virtio ring compatible devices potentially can be used as the
> > vhost data path accelerators. We can setup the accelerator based
> > on the informations (e.g. memory table, features, ring info, etc)
> > available on the vhost backend. And accelerator will be able to use
> > the virtio ring provided by the virtio driver in the VM directly.
> > So the virtio driver in the VM can exchange e.g. network packets
> > with the accelerator directly via the virtio ring. That is to say,
> > we will be able to use the accelerator to accelerate the vhost
> > data path. We call it vDPA: vhost Data Path Acceleration.
> > 
> > Notice: Although the accelerator can talk with the virtio driver
> > in the VM via the virtio ring directly. The control path events
> > (e.g. device start/stop) in the VM will still be trapped and handled
> > by QEMU, and QEMU will deliver such events to the vhost backend
> > via standard vhost protocol.
> > 
> > Below link is an example showing how to setup a such environment
> > via nested VM. In this case, the virtio device in the outer VM is
> > the accelerator. It will be used to accelerate the virtio device
> > in the inner VM. In reality, we could use virtio ring compatible
> > hardware device as the accelerators.
> > 
> > http://dpdk.org/ml/archives/dev/2017-December/085044.html
> > 
> > In above example, it doesn't require any changes to QEMU, but
> > it has lower performance compared with the traditional VFIO
> > based PCI passthru. And that's the problem this patch set wants
> > to solve.
> > 
> > The performance issue of vDPA/vhost-user and solutions
> > ==
> > 
> > For vhost-user backend, the critical issue in vDPA is that the
> > data path performance is relatively low and some host threads are
> > needed for the data path, because some necessary mechanisms are
> > missing to support:
> > 
> > 1) guest driver notifies the device directly;
> > 2) device interrupts the guest directly;
> > 
> > So this patch set does some small extensions to the vhost-user
> > protocol to make both of them possible. It leverages the same
> > mechanisms (e.g. EPT and Posted-Interrupt on Intel platform) as
> > the PCI passthru.
> > 
> > A new protocol feature bit is added to negotiate the accelerator
> > feature support. Two new slave message types are added to control
> > the notify region and queue interrupt passthru for each queue.
> > >From the view of vhost-user protocol design, it's very flexible.
> > The passthru can be enabled/disabled for each queue individually,
> > and it's possible to accelerate each queue by different devices.
> > More design and implementation details can be found from the last
> > patch.
> > 
> > Difference between vDPA and PCI passthru
> > 
> > 
> > The key difference between PCI passthru and vDPA is that, in vDPA
> > only the data path of the device (e.g. DMA ring, notify region and
> > queue interrupt) is pass-throughed to the VM, the device control
> > path (e.g. PCI configuration space and MMIO regions) is still
> > defined and emulated by QEMU.
> > 
> > The benefits of keeping virtio device emulation in QEMU compared
> > with virtio device PCI passthru include (but not limit to):
> > 
> > - consistent device interface for guest OS in the VM;
> > - max flexibility on the hardware (i.e. the accelerators) design;
> > - leveraging the existing virtio live-migration framework;
> > 
> > Why extend vhost-user for vDPA
> > ==
> > 
> > We have already implemented various virtual switches (e.g. OVS-DPDK)
> > based on vhost-user for VMs in the Cloud. They are purely software
> > running on CPU cores. When we have accelerators for such NFVi applications,
> > it's ideal if the applications could keep using the original interface
> > (i.e. vhost-user netdev) with QEMU, and infrastructure is able to decide
> > when and how to switch between CPU and accelerators within the interface.
> > And the switching (i.e. switch between CPU and accelerators) can be done
> > flexibly and quickly inside the applications.
> > 
> > More d

[virtio-dev] RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-23 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, March 22, 2018 6:06 PM
> 
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Wednesday, March 21, 2018 10:24 PM
> >
> > On 21/03/18 13:14, Jean-Philippe Brucker wrote:
> > > On 21/03/18 06:43, Tian, Kevin wrote:
> > > [...]
> > >>> +
> > >>> +#include 
> > >>> +
> > >>> +#define MSI_IOVA_BASE  0x800
> > >>> +#define MSI_IOVA_LENGTH0x10
> > >>
> > >> this is ARM specific, and according to virtio-iommu spec isn't it
> > >> better probed on the endpoint instead of hard-coding here?
> > >
> > > These values are arbitrary, not really ARM-specific even if ARM is the
> > > only user yet: we're just reserving a random IOVA region for mapping
> > MSIs.
> > > It is hard-coded because of the way iommu-dma.c works, but I don't
> > quite
> > > remember why that allocation isn't dynamic.
> >
> > The host kernel needs to have *some* MSI region in place before the
> > guest can start configuring interrupts, otherwise it won't know what
> > address to give to the underlying hardware. However, as soon as the host
> > kernel has picked a region, host userspace needs to know that it can no
> > longer use addresses in that region for DMA-able guest memory. It's a
> > lot easier when the address is fixed in hardware and the host userspace
> > will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
> > the
> > more general case where MSI writes undergo IOMMU address translation
> > so
> > it's an arbitrary IOVA, this has the potential to conflict with stuff
> > like guest memory hotplug.
> >
> > What we currently have is just the simplest option, with the host kernel
> > just picking something up-front and pretending to host userspace that
> > it's a fixed hardware address. There's certainly scope for it to be a
> > bit more dynamic in the sense of adding an interface to let userspace
> > move it around (before attaching any devices, at least), but I don't
> > think it's feasible for the host kernel to second-guess userspace enough
> > to make it entirely transparent like it is in the DMA API domain case.
> >
> > Of course, that's all assuming the host itself is using a virtio-iommu
> > (e.g. in a nested virt or emulation scenario). When it's purely within a
> > guest then an MSI reservation shouldn't matter so much, since the guest
> > won't be anywhere near the real hardware configuration anyway.
> >
> > Robin.
> 
> Curious since anyway we are defining a new iommu architecture
> is it possible to avoid those ARM-specific burden completely?
> 

OK, after some study around those tricks below is my learning:

- MSI_IOVA window is used only on request (iommu_dma_get
_msi_page), not meant to take effect on all architectures once 
initialized. e.g. ARM GIC does it but not x86. So it is reasonable 
for virtio-iommu driver to implement such capability;

- I thought whether hardware MSI doorbell can be always reported
on virtio-iommu since it's newly defined. Looks there is a problem
if underlying IOMMU is sw-managed MSI style - valid mapping is
expected in all level of translations, meaning guest has to manage
stage-1 mapping in nested configuration since stage-1 is owned
by guest. 

Then virtio-iommu is naturally expected to report the same MSI 
model as supported by underlying hardware. Below are some
further thoughts along this route (use 'IOMMU' to represent the
physical one and 'virtio-iommu' for virtual one):



In the scope of current virtio-iommu spec v.6, there is no nested
consideration yet. Guest driver is expected to use MAP/UNMAP
interface on assigned endpoints. In this case the MAP requests
(IOVA->GPA) is caught and maintained within Qemu which then 
further talks to VFIO to map IOVA->HPA in IOMMU.

Qemu can learn the MSI model of IOMMU from sysfs.

For hardware MSI doorbell (x86 and some ARM):
* Host kernel reports to Qemu as IOMMU_RESV_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_MSI
* Guest takes the range as IOMMU_RESV_MSI. reserved
* Qemu MAP database has no mapping for the doorbell
* Physical IOMMU page table has no mapping for the doorbell
* MSI from passthrough device bypass IOMMU
* MSI from emulated device bypass virtio-iommu

For software MSI doorbell (most ARM):
* Host kernel reports to Qemu as IOMMU_RESV_SW_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_RESERVED
* Guest takes the range as IOMMU_RESV_RESERVED
* vGIC requests to map 'GPA of the virtual doorbell'
* a map request (IOVA->GPA) sent on endpoint
* Qemu maintains the mapping in MAP database
* but no VFIO_MAP request since it's purely virtual
* GIC requests to map 'HPA of the physical doorbell'
* e.g. triggered by VFIO enable msi
* IOMMU now includes a valid mapping (IOVA->HPA)
* MSI from emulated device go through Qemu MAP
database (IOVA->'GPA of virtual doorbell') and then hit vGIC
* MSI from passthrough device go through IOMMU
(IOVA->'HPA of physical doorbell') and then hit GIC

In this case, host doorbel